抓取豆瓣发现2016年所有评论过文章的用户的头像
上次写了个简单的抓取图片的,这次尝试了下稍微复杂一点的。
发现豆瓣发现查看更多按钮提交的get请求参数是日期,所以考虑抓取一年中所有评论过的用户的头像。
其实也就是多次通过正则抓取url,最后实现下载头像的目的。
然后还没有解决当请求2月30日这种不存在日期404的问题,偷下懒暂时直接定义range(1,29),这个问题放到后面解决吧。
开学准备尝试抓一下四级成绩,研究下模拟登陆
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | importurllib.request,os,redefopenurl(url):    `header=` `{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number5 index4 alt2">`    req=` `urllib.request.Request(url = url,headers=` `header)`</div><div class="line number6 index5 alt1">`    res=` `urllib.request.urlopen(req)`</div><div class="line number7 index6 alt2">`    page=res.read()</div><div class="line number8 index7 alt1">down(page)`</div><div class="line number9 index8 alt2"> </div><div class="line number10 index9 alt1">`def` `find_1(url):#从首页抓取所有文章</div><div class="line number11 index10 alt2">header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}</div><div class="line number12 index11 alt1">req = urllib.request.Request(url=` `url,headers = header)</div><div class="line number13 index12 alt2">res= urllib.request.urlopen(req)</div><div class="line number14 index13 alt1">page =res.read()`</div><div class="line number15 index14 alt2">`    page=` `str(page).replace("\\","“)</div><div class="line number16 index15 alt1">for` `link,t in set(re.findall(r‘(https[\S]?(note/)\d{9}/)’,page)):`</div><div class="line number17 index16 alt2">`        openurl(link)#抓取当前页面(第一页)`</div><div class="line number18 index17 alt1">`        find_2(link)</div><div class="line number19 index18 alt2"> </div><div class="line number20 index19 alt1">def find_2(url):#抓取更多评论页面`</div><div class="line number21 index20 alt2">`    header=` `{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number22 index21 alt1">`    req=` `urllib.request.Request(url = url,headers=` `header)`</div><div class="line number23 index22 alt2">`    res=` `urllib.request.urlopen(req)`</div><div class="line number24 index23 alt1">`    page=res.read()</div><div class="line number25 index24 alt2">for` `link,t in set(re.findall(r‘(https[\S]?(comments))’,str(page))):`</div><div class="line number26 index25 alt1">`        openurl(link)</div><div class="line number27 index26 alt2"> </div><div class="line number28 index27 alt1">def map(path):#设置图片储存路径</div><div class="line number29 index28 alt2">download = “D:\test”</div><div class="line number30 index29 alt1">if` `not` `os.path.isdir(download):#判断目录是否存在</div><div class="line number31 index30 alt2">os.mkdir(download)#不存在时创建目录</div><div class="line number32 index31 alt1">flag = path.rindex('/')#查找最后一个/`</div><div class="line number33 index32 alt2">`    t=` `os.path.join(download,path[flag+1:])</div><div class="line number34 index33 alt1">return` `t`</div><div class="line number35 index34 alt2"> </div><div class="line number36 index35 alt1">`def` `down(page):`</div><div class="line number37 index36 alt2">`    page=` `str(page)</div><div class="line number38 index37 alt1">for` `link,t in set(re.findall(r‘(https[\S]?icon[\S]?(jpg|png|gif))’,page)):#抓取头像的链接</div><div class="line number39 index38 alt2">print(link)</div><div class="line number40 index39 alt1">try:</div><div class="line number41 index40 alt2">urllib.request.urlretrieve(link,map(link))`</div><div class="line number42 index41 alt1">`        except:`</div><div class="line number43 index42 alt2">`            print(“失败”)`</div><div class="line number44 index43 alt1"> </div><div class="line number45 index44 alt2">`for` `a in range(1,12):`</div><div class="line number46 index45 alt1">`    for bin` `range(1,29):</div><div class="line number47 index46 alt2">url = “https://www.douban.com/j/explore/rec_feed?day=2016-%d-%d“%(a,b)#构造get请求`</div><div class="line number48 index47 alt1">`        find_1(url)</div><div class="line number49 index48 alt2">` |