@wade123
2019-03-26T08:51:58.000000Z
字数 5549
阅读 1238
Python入门爬虫与数据分析
上一个爬虫实战,我们爬取了网页中的文本信息,这一讲介绍如何爬取网页中的图片。先尝试爬一张图片,然后爬一页图片,最后爬取多页图片,循环渐进、从易到难。
这是我们要爬取的网站:
随意打开一篇文章,挑选一张图片,使用 Requests ,5 行代码就把它下载下来。
import requests
url = 'http://cms-bucket.nosdn.127.net/2018/08/31/df39aac05e0b417a80487562cdf6ca40.png'
response = requests.get(url)
with open('北京房租地图.jpg', 'wb') as f:
f.write(response.content)
这里获取图片,使用 response.content 返回图片二进制数据。若想了解,图片为什么可以用二进制数据进行存储,可以参考这个教程:
https://www.zhihu.com/question/36269548/answer/66734582
接下来,我们下载这篇文章中的所有图片,超过 15 张。
网址:http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html
先获取这篇文章的网页源代码:
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
url = 'http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html'
response = requests.get(url,headers = headers)
if response.status_code == 200:
# return response.text
print(response.text) # 测试网页内容是否提取成功 ok
接下来解析源代码,从中提取出所有图片 url。
之前在爬取猫眼票房分析的实战中,我们使用了多种方法提取网页信息。现在,再来练习一次,分别使用:正则表达式、Xpath、BeautifulSoup、CSS、PyQuery 提取。
在网页中定位到图片 url 所在位置: <p>节点--<a>节点--<img>节点里的 src 属性值
import re
pattern =re.compile('<p>.*?<img alt="房租".*?src="(.*?)".*?style',re.S)
items = re.findall(pattern,html)
# print(items)
for item in items:
yield{
'url':item
}
输出提取结果如下:
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/425eca61322a4f99837988bb78a001ac.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/df39aac05e0b417a80487562cdf6ca40.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/d6cb58a6bb014b8683b232f3c00f0e39.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/88d2e535765a4ed09e03877238647aa5.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/09/01/98d2f9579e9e49aeb76ad6155e8fc4ea.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/7410ed4041a94cab8f30e8de53aaaaa1.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/49a0c80a140b4f1aa03724654c5a39af.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/3070964278bf4637ba3d92b6bb771cea.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/812b7a51475246a9b57f467940626c5c.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/8bcbc7d180f74397addc74e47eaa1f63.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/e593efca849744489096a77aafd10d3e.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/7653feecbfd94758a8a0ff599915d435.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/edbaa24a17dc4cca9430761bfc557ffb.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/f768d440d9f14b8bb58e3c425345b97e.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/3430043fd305411782f43d3d8635d632.png'}
{'url': 'http://cms-bucket.nosdn.127.net/2018/08/31/111ba73d11084c68b8db85cdd6d474a7.png'}
from lxml import etree
parse = etree.HTML(html)
items = parse.xpath('*//p//img[@alt = "房租"]/@src')
print(items)
for item in items:
yield{
'url':item
}
结果同上。
soup = BeautifulSoup(html,'lxml')
items = soup.select('p > a > img') #>表示下级绝对节点
# print(items)
for item in items:
yield{
'url':item['src']
}
soup = BeautifulSoup(html,'lxml')
# 每个网页只能拥有一个<H1>标签,因此唯一
item = soup.find_all(name='img',width =['100%'])
for i in range(len(item)):
url = item[i].attrs['src']
yield{
'url':url
}
from pyquery import PyQuery as pq
data = pq(html)
data2 = data('p > a > img')
# print(items)
for item in data2.items(): #注意这里和 BeautifulSoup 的 css 用法不同
yield{
'url':item.attr('src')
# 或者'url':item.attr.src
}
以上用 5 种方法提取出了网页 url 地址。任选一种即可,这里选择第 4 种方法提取内容,下面下载图片。
title = pic.get('title')
url = pic.get('pic')
# 设置图片编号顺序
num = pic.get('num')
# 建立文件夹
if not os.path.exists(title):
os.mkdir(title)
# 获取图片 url 网页信息
response = requests.get(url,headers = headers)
# 建立图片存放地址
file_path = '{0}\{1}.{2}' .format(title,num,'jpg')
# 文件名采用编号方便按顺序查看
# 开始下载图片
with open(file_path,'wb') as f:
f.write(response.content)
print('该图片已下载完成',title)
提取出的网址是字典结构,可用 get 方法调用键和值。建立图片存放文件夹和编号,运行程序,全文图片就下载下来了。
将上述代码再完善一下,增加异常处理、图片标题、编号,完整代码如下所示:
import requests
from bs4 import BeautifulSoup
import re
import os
from hashlib import md5
from requests.exceptions import RequestException
from multiprocessing import Pool
from urllib.parse import urlencode
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
def get_page():
# 下载 1 页
url = 'http://data.163.com/18/0901/01/DQJ3D0D9000181IU.html'
# 增加异常捕获语句
try:
response = requests.get(url,headers = headers)
if response.status_code == 200:
return response.text
# print(response.text) # 测试网页内容是否提取成功
except RequestException:
print('网页请求失败')
return None
def parse_page(html):
soup = BeautifulSoup(html,'lxml')
# 获取 title
title = soup.h1.string
# 每个网页只能拥有一个<H1>标签,因此唯一
item = soup.find_all(name='img',width =['100%'])
# print(item) # 测试
for i in range(len(item)):
pic = item[i].attrs['src']
yield{
'title':title,
'pic':pic,
'num':i # 图片添加编号顺序
}
def save_pic(pic):
title = pic.get('title')
url = pic.get('pic')
# 设置图片编号顺序
num = pic.get('num')
if not os.path.exists(title):
os.mkdir(title)
# 获取图片 url 网页信息
response = requests.get(url,headers = headers)
try:
# 建立图片存放地址
if response.status_code == 200:
file_path = '{0}\{1}.{2}' .format(title,num,'jpg')
# 文件名采用编号方便按顺序查看,而未采用哈希值 md5(response.content).hexdigest()
if not os.path.exists(file_path):
# 开始下载图片
with open(file_path,'wb') as f:
f.write(response.content)
print('该图片已下载完成',title)
else:
print('该图片%s 已下载' %title)
except RequestException as e:
print(e,'图片获取失败')
return None
def main():
# get_page() # 测试网页内容是获取成功 ok
html = get_page()
# parse_page(html) # 测试网页内容是否解析成功 ok
data = parse_page(html)
for pic in data:
# print(pic) #测试 dict
save_pic(pic)
# 单进程
if __name__ == '__main__':
main()
以上,我们使用爬虫先爬取了单张图片,接着爬取了一页图片。
下一篇文章练习爬取多页图片。