抓取豆瓣发现2016年所有评论过文章的用户的头像
上次写了个简单的抓取图片的,这次尝试了下稍微复杂一点的。
发现豆瓣发现查看更多按钮提交的get请求参数是日期,所以考虑抓取一年中所有评论过的用户的头像。
其实也就是多次通过正则抓取url,最后实现下载头像的目的。
然后还没有解决当请求2月30日这种不存在日期404的问题,偷下懒暂时直接定义range(1,29),这个问题放到后面解决吧。
开学准备尝试抓一下四级成绩,研究下模拟登陆
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import urllib.request,os,redef openurl(url): `header =` `{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number5 index4 alt2">` req =` `urllib.request.Request(url = url,headers =` `header)`</div><div class="line number6 index5 alt1">` res=` `urllib.request.urlopen(req)`</div><div class="line number7 index6 alt2">` page =res.read()</div><div class="line number8 index7 alt1"> down(page)`</div><div class="line number9 index8 alt2"> </div><div class="line number10 index9 alt1">`def` `find_1(url):#从首页抓取所有文章</div><div class="line number11 index10 alt2"> header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}</div><div class="line number12 index11 alt1"> req = urllib.request.Request(url =` `url,headers = header)</div><div class="line number13 index12 alt2"> res= urllib.request.urlopen(req)</div><div class="line number14 index13 alt1"> page =res.read()`</div><div class="line number15 index14 alt2">` page =` `str(page).replace("\\","“)</div><div class="line number16 index15 alt1"> for` `link,t in set(re.findall(r‘(https[\S]?(note/)\d{9}/)’,page)):`</div><div class="line number17 index16 alt2">` openurl(link)#抓取当前页面(第一页)`</div><div class="line number18 index17 alt1">` find_2(link)</div><div class="line number19 index18 alt2"> </div><div class="line number20 index19 alt1">def find_2(url):#抓取更多评论页面`</div><div class="line number21 index20 alt2">` header =` `{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number22 index21 alt1">` req =` `urllib.request.Request(url = url,headers =` `header)`</div><div class="line number23 index22 alt2">` res=` `urllib.request.urlopen(req)`</div><div class="line number24 index23 alt1">` page =res.read()</div><div class="line number25 index24 alt2"> for` `link,t in set(re.findall(r‘(https[\S]?(comments))’,str(page))):`</div><div class="line number26 index25 alt1">` openurl(link)</div><div class="line number27 index26 alt2"> </div><div class="line number28 index27 alt1">def map(path):#设置图片储存路径</div><div class="line number29 index28 alt2"> download = “D:\test”</div><div class="line number30 index29 alt1"> if` `not` `os.path.isdir(download):#判断目录是否存在</div><div class="line number31 index30 alt2"> os.mkdir(download)#不存在时创建目录</div><div class="line number32 index31 alt1"> flag = path.rindex('/')#查找最后一个/`</div><div class="line number33 index32 alt2">` t =` `os.path.join(download,path[flag+1:])</div><div class="line number34 index33 alt1"> return` `t`</div><div class="line number35 index34 alt2"> </div><div class="line number36 index35 alt1">`def` `down(page):`</div><div class="line number37 index36 alt2">` page =` `str(page)</div><div class="line number38 index37 alt1"> for` `link,t in set(re.findall(r‘(https[\S]?icon[\S]?(jpg|png|gif))’,page)):#抓取头像的链接</div><div class="line number39 index38 alt2"> print(link)</div><div class="line number40 index39 alt1"> try:</div><div class="line number41 index40 alt2"> urllib.request.urlretrieve(link,map(link))`</div><div class="line number42 index41 alt1">` except:`</div><div class="line number43 index42 alt2">` print(“失败”)`</div><div class="line number44 index43 alt1"> </div><div class="line number45 index44 alt2">`for` `a in range(1,12):`</div><div class="line number46 index45 alt1">` for b in` `range(1,29):</div><div class="line number47 index46 alt2"> url = “https://www.douban.com/j/explore/rec_feed?day=2016-%d-%d“%(a,b)#构造get请求`</div><div class="line number48 index47 alt1">` find_1(url)</div><div class="line number49 index48 alt2"> ` |