@wade123
2019-03-26T08:51:58.000000Z
字数 5549
阅读 1328
Python入门爬虫与数据分析
上一个爬虫实战,我们爬取了网页中的文本信息,这一讲介绍如何爬取网页中的图片。先尝试爬一张图片,然后爬一页图片,最后爬取多页图片,循环渐进、从易到难。
这是我们要爬取的网站:
随意打开一篇文章,挑选一张图片,使用 Requests ,5 行代码就把它下载下来。

import requestsurl = 'http://cms-bucket.nosdn.127.net/2018/08/31/df39aac05e0b417a80487562cdf6ca40.png'response = requests.get(url)with open('北京房租地图.jpg', 'wb') as f:f.write(response.content)
这里获取图片,使用 response.content 返回图片二进制数据。若想了解,图片为什么可以用二进制数据进行存储,可以参考这个教程:
https://www.zhihu.com/question/36269548/answer/66734582
接下来,我们下载这篇文章中的所有图片,超过 15 张。
网址:http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html
先获取这篇文章的网页源代码:
import requestsheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}url = 'http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html'response = requests.get(url,headers = headers)if response.status_code == 200:# return response.textprint(response.text) # 测试网页内容是否提取成功 ok
接下来解析源代码,从中提取出所有图片 url。
之前在爬取猫眼票房分析的实战中,我们使用了多种方法提取网页信息。现在,再来练习一次,分别使用:正则表达式、Xpath、BeautifulSoup、CSS、PyQuery 提取。
在网页中定位到图片 url 所在位置: <p>节点--<a>节点--<img>节点里的 src 属性值

import repattern =re.compile('<p>.*?<img alt="房租".*?src="(.*?)".*?style',re.S)items = re.findall(pattern,html)# print(items)for item in items:yield{'url':item}
输出提取结果如下:
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/425eca61322a4f99837988bb78a001ac.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/df39aac05e0b417a80487562cdf6ca40.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/d6cb58a6bb014b8683b232f3c00f0e39.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/88d2e535765a4ed09e03877238647aa5.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/09/01/98d2f9579e9e49aeb76ad6155e8fc4ea.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/7410ed4041a94cab8f30e8de53aaaaa1.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/49a0c80a140b4f1aa03724654c5a39af.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/3070964278bf4637ba3d92b6bb771cea.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/812b7a51475246a9b57f467940626c5c.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/8bcbc7d180f74397addc74e47eaa1f63.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/e593efca849744489096a77aafd10d3e.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/7653feecbfd94758a8a0ff599915d435.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/edbaa24a17dc4cca9430761bfc557ffb.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/f768d440d9f14b8bb58e3c425345b97e.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/3430043fd305411782f43d3d8635d632.png'}{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/111ba73d11084c68b8db85cdd6d474a7.png'}
from lxml import etreeparse = etree.HTML(html)items = parse.xpath('*//p//img[@alt = "房租"]/@src')print(items)for item in items:yield{'url':item}
结果同上。
soup = BeautifulSoup(html,'lxml')items = soup.select('p > a > img') #>表示下级绝对节点# print(items)for item in items:yield{'url':item['src']}
soup = BeautifulSoup(html,'lxml')# 每个网页只能拥有一个<H1>标签,因此唯一item = soup.find_all(name='img',width =['100%'])for i in range(len(item)):url = item[i].attrs['src']yield{'url':url}
from pyquery import PyQuery as pqdata = pq(html)data2 = data('p > a > img')# print(items)for item in data2.items(): #注意这里和 BeautifulSoup 的 css 用法不同yield{'url':item.attr('src')# 或者'url':item.attr.src}
以上用 5 种方法提取出了网页 url 地址。任选一种即可,这里选择第 4 种方法提取内容,下面下载图片。
title = pic.get('title')url = pic.get('pic')# 设置图片编号顺序num = pic.get('num')# 建立文件夹if not os.path.exists(title):os.mkdir(title)# 获取图片 url 网页信息response = requests.get(url,headers = headers)# 建立图片存放地址file_path = '{0}\{1}.{2}' .format(title,num,'jpg')# 文件名采用编号方便按顺序查看# 开始下载图片with open(file_path,'wb') as f:f.write(response.content)print('该图片已下载完成',title)
提取出的网址是字典结构,可用 get 方法调用键和值。建立图片存放文件夹和编号,运行程序,全文图片就下载下来了。

将上述代码再完善一下,增加异常处理、图片标题、编号,完整代码如下所示:
import requestsfrom bs4 import BeautifulSoupimport reimport osfrom hashlib import md5from requests.exceptions import RequestExceptionfrom multiprocessing import Poolfrom urllib.parse import urlencodeheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}def get_page():# 下载 1 页url = 'http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html'# 增加异常捕获语句try:response = requests.get(url,headers = headers)if response.status_code == 200:return response.text# print(response.text) # 测试网页内容是否提取成功except RequestException:print('网页请求失败')return Nonedef parse_page(html):soup = BeautifulSoup(html,'lxml')# 获取 titletitle = soup.h1.string# 每个网页只能拥有一个<H1>标签,因此唯一item = soup.find_all(name='img',width =['100%'])# print(item) # 测试for i in range(len(item)):pic = item[i].attrs['src']yield{'title':title,'pic':pic,'num':i # 图片添加编号顺序}def save_pic(pic):title = pic.get('title')url = pic.get('pic')# 设置图片编号顺序num = pic.get('num')if not os.path.exists(title):os.mkdir(title)# 获取图片 url 网页信息response = requests.get(url,headers = headers)try:# 建立图片存放地址if response.status_code == 200:file_path = '{0}\{1}.{2}' .format(title,num,'jpg')# 文件名采用编号方便按顺序查看,而未采用哈希值 md5(response.content).hexdigest()if not os.path.exists(file_path):# 开始下载图片with open(file_path,'wb') as f:f.write(response.content)print('该图片已下载完成',title)else:print('该图片%s 已下载' %title)except RequestException as e:print(e,'图片获取失败')return Nonedef main():# get_page() # 测试网页内容是获取成功 okhtml = get_page()# parse_page(html) # 测试网页内容是否解析成功 okdata = parse_page(html)for pic in data:# print(pic) #测试 dictsave_pic(pic)# 单进程if __name__ == '__main__':main()
以上,我们使用爬虫先爬取了单张图片,接着爬取了一页图片。
下一篇文章练习爬取多页图片。