python学习笔记(二)

抓取豆瓣发现2016年所有评论过文章的用户的头像
上次写了个简单的抓取图片的,这次尝试了下稍微复杂一点的。
发现豆瓣发现查看更多按钮提交的get请求参数是日期,所以考虑抓取一年中所有评论过的用户的头像。
其实也就是多次通过正则抓取url,最后实现下载头像的目的。
然后还没有解决当请求2月30日这种不存在日期404的问题,偷下懒暂时直接定义range(1,29),这个问题放到后面解决吧。
开学准备尝试抓一下四级成绩,研究下模拟登陆

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import urllib.request,os,re
 
def openurl(url):
&nbsp;&nbsp;&nbsp;&nbsp;`header =`&nbsp;`{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number5 index4 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;req =`&nbsp;`urllib.request.Request(url&nbsp;=&nbsp;url,headers =`&nbsp;`header)`</div><div class="line number6 index5 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;res=`&nbsp;`urllib.request.urlopen(req)`</div><div class="line number7 index6 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;page =res.read()</div><div class="line number8 index7 alt1">    down(page)`</div><div class="line number9 index8 alt2">&nbsp;</div><div class="line number10 index9 alt1">`def`&nbsp;`find_1(url):#从首页抓取所有文章</div><div class="line number11 index10 alt2">    header&nbsp;=&nbsp;{'User-Agent':'Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;6.1;&nbsp;WOW64;&nbsp;rv:23.0)&nbsp;Gecko/20100101&nbsp;Firefox/23.0'}</div><div class="line number12 index11 alt1">    req&nbsp;=&nbsp;urllib.request.Request(url =`&nbsp;`url,headers&nbsp;=&nbsp;header)</div><div class="line number13 index12 alt2">    res=&nbsp;urllib.request.urlopen(req)</div><div class="line number14 index13 alt1">    page&nbsp;=res.read()`</div><div class="line number15 index14 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;page =`&nbsp;`str(page).replace("\\","“)</div><div class="line number16 index15 alt1">    for`&nbsp;`link,t&nbsp;in&nbsp;set(re.findall(r‘(https[\S]?(note/)\d{9}/)’,page)):`</div><div class="line number17 index16 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;openurl(link)#抓取当前页面(第一页)`</div><div class="line number18 index17 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;find_2(link)</div><div class="line number19 index18 alt2">&nbsp;</div><div class="line number20 index19 alt1">def&nbsp;find_2(url):#抓取更多评论页面`</div><div class="line number21 index20 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;header =`&nbsp;`{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number22 index21 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;req =`&nbsp;`urllib.request.Request(url&nbsp;=&nbsp;url,headers =`&nbsp;`header)`</div><div class="line number23 index22 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;res=`&nbsp;`urllib.request.urlopen(req)`</div><div class="line number24 index23 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;page =res.read()</div><div class="line number25 index24 alt2">    for`&nbsp;`link,t&nbsp;in&nbsp;set(re.findall(r‘(https[\S]?(comments))’,str(page))):`</div><div class="line number26 index25 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;openurl(link)</div><div class="line number27 index26 alt2">&nbsp;</div><div class="line number28 index27 alt1">def&nbsp;map(path):#设置图片储存路径</div><div class="line number29 index28 alt2">    download&nbsp;=&nbsp;“D:\test”</div><div class="line number30 index29 alt1">    if`&nbsp;`not`&nbsp;`os.path.isdir(download):#判断目录是否存在</div><div class="line number31 index30 alt2">        os.mkdir(download)#不存在时创建目录</div><div class="line number32 index31 alt1">    flag&nbsp;=&nbsp;path.rindex('/')#查找最后一个/`</div><div class="line number33 index32 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;=`&nbsp;`os.path.join(download,path[flag+1:])</div><div class="line number34 index33 alt1">    return`&nbsp;`t`</div><div class="line number35 index34 alt2">&nbsp;</div><div class="line number36 index35 alt1">`def`&nbsp;`down(page):`</div><div class="line number37 index36 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;page =`&nbsp;`str(page)</div><div class="line number38 index37 alt1">    for`&nbsp;`link,t&nbsp;in&nbsp;set(re.findall(r‘(https[\S]?icon[\S]?(jpg|png|gif))’,page)):#抓取头像的链接</div><div class="line number39 index38 alt2">        print(link)</div><div class="line number40 index39 alt1">        try:</div><div class="line number41 index40 alt2">            urllib.request.urlretrieve(link,map(link))`</div><div class="line number42 index41 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;except:`</div><div class="line number43 index42 alt2">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(“失败”)`</div><div class="line number44 index43 alt1">&nbsp;</div><div class="line number45 index44 alt2">`for`&nbsp;`a&nbsp;in&nbsp;range(1,12):`</div><div class="line number46 index45 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;in`&nbsp;`range(1,29):</div><div class="line number47 index46 alt2">        url&nbsp;=&nbsp;https://www.douban.com/j/explore/rec_feed?day=2016-%d-%d%(a,b)#构造get请求`</div><div class="line number48 index47 alt1">`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;find_1(url)</div><div class="line number49 index48 alt2">   `