第一篇学习笔记,写了一个很简单的抓取单独页面图片的python爬虫,使用的python3.6
这里以www.douban.com为例
主要思路是用正则表达式抓取源码中图片的地址(顺手伪造了UA)
虽然是很简单的一个东西,但毕竟还是花费那么多努力,推开了爬虫的大门。
讲道理之所以花这么多时间,很大程度上是因为网上3的资料好少,走了不少弯路,不过多走点弯路也好。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import urllib.request,os,redef urlopen(url): `header =` `{‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0’}`</div><div class="line number5 index4 alt2">` req =` `urllib.request.Request(url = url,headers =` `header)`</div><div class="line number6 index5 alt1">` res=` `urllib.request.urlopen(req)`</div><div class="line number7 index6 alt2">` page =res.read()</div><div class="line number8 index7 alt1"> down(page)`</div><div class="line number9 index8 alt2"> </div><div class="line number10 index9 alt1">`def` `map(path):#设置图片储存路径`</div><div class="line number11 index10 alt2">` download =` `"D:\\test"`</div><div class="line number12 index11 alt1">` if not os.path.isdir(download):#判断目录是否存在`</div><div class="line number13 index12 alt2">` os.mkdir(download)#不存在时创建目录`</div><div class="line number14 index13 alt1">` flag =` `path.rindex(‘/‘)#查找最后一个/</div><div class="line number15 index14 alt2"> t = os.path.join(download,path[flag+1:])`</div><div class="line number16 index15 alt1">` return t;</div><div class="line number17 index16 alt2"> </div><div class="line number18 index17 alt1">def down(page):</div><div class="line number19 index18 alt2"> for` `link,t in set(re.findall(r‘(http[s][\S]*?(jpg|png|gif))’,str(page))):`</div><div class="line number20 index19 alt1">` print(link)`</div><div class="line number21 index20 alt2">` try:`</div><div class="line number22 index21 alt1">` urllib.request.urlretrieve(link,map(link))</div><div class="line number23 index22 alt2"> except:</div><div class="line number24 index23 alt1"> print("失败")</div><div class="line number25 index24 alt2"> </div><div class="line number26 index25 alt1">url = “https://www.douban.com“</div><div class="line number27 index26 alt2">urlopen(url)</div><div class="line number28 index27 alt1"> </div><div class="line number29 index28 alt2"> ` |
继续努力吧,前路漫长