python学习笔记(一)

第一篇学习笔记,写了一个很简单的抓取单独页面图片的python爬虫,使用的python3.6
这里以www.douban.com为例
主要思路是用正则表达式抓取源码中图片的地址(顺手伪造了UA)
虽然是很简单的一个东西,但毕竟还是花费那么多努力,推开了爬虫的大门。
讲道理之所以花这么多时间,很大程度上是因为网上3的资料好少,走了不少弯路,不过多走点弯路也好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import urllib.request,os,re
 
def urlopen(url):
&nbsp;&nbsp;&nbsp;&nbsp;`header =`&nbsp;`{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number5 index4 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;req =`&nbsp;`urllib.request.Request(url&nbsp;=&nbsp;url,headers =`&nbsp;`header)`</div><div class="line number6 index5 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;res=`&nbsp;`urllib.request.urlopen(req)`</div><div class="line number7 index6 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;page =res.read()</div><div class="line number8 index7 alt1">    down(page)`</div><div class="line number9 index8 alt2">&nbsp;</div><div class="line number10 index9 alt1">`def`&nbsp;`map(path):#设置图片储存路径`</div><div class="line number11 index10 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;download =`&nbsp;`"D:\\test"`</div><div class="line number12 index11 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;not&nbsp;os.path.isdir(download):#判断目录是否存在`</div><div class="line number13 index12 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;os.mkdir(download)#不存在时创建目录`</div><div class="line number14 index13 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;flag =`&nbsp;`path.rindex(‘/‘)#查找最后一个/</div><div class="line number15 index14 alt2">    t&nbsp;=&nbsp;os.path.join(download,path[flag+1:])`</div><div class="line number16 index15 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;t;</div><div class="line number17 index16 alt2">&nbsp;</div><div class="line number18 index17 alt1">def&nbsp;down(page):</div><div class="line number19 index18 alt2">    for`&nbsp;`link,t&nbsp;in&nbsp;set(re.findall(r‘(http[s][\S]*?(jpg|png|gif))’,str(page))):`</div><div class="line number20 index19 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(link)`</div><div class="line number21 index20 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;try:`</div><div class="line number22 index21 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;urllib.request.urlretrieve(link,map(link))</div><div class="line number23 index22 alt2">        except:</div><div class="line number24 index23 alt1">            print("失败")</div><div class="line number25 index24 alt2">&nbsp;</div><div class="line number26 index25 alt1">url =&nbsp;https://www.douban.com</div><div class="line number27 index26 alt2">urlopen(url)</div><div class="line number28 index27 alt1">&nbsp;</div><div class="line number29 index28 alt2">   ` 

继续努力吧,前路漫长