@qinian 2019-04-11T22:06:53.000000Z 字数 6759 阅读 1144

uril模块介绍

爬虫

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先:

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar

在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request

urllib是Python自带的标准库，无需安装，直接可以用。
提供了如下功能：

网页请求大类(urllib.request)

URL解析(urllib.parse)

代理和cookie设置

异常处理(urllib.error)

robots.txt解析模块(urllib.robotparser)

1、网页请求（urllib包中urllib.request模块，使用这个模型进行网页请求）

1、urllib.request.urlopen （打开网页）

urlopen一般常用的有三个参数，它的参数如下：
1） r = urllib.requeset.urlopen(url,data,timeout)
2） url：链接格式：协议://主机名:[端口]/路径，（比如url=r"http://www.baidu.com/"，有可能有转义字符可以加上r
3）data：附加参数必须是字节流编码格式的内容(bytes类型)，可通过bytes()函数转化，如果要传递这个参数，请求方式就不再是GET方式请求，而是POST方式
4）timeout: 超时单位为秒

1、1 get请求（urllib.request.urlopen）（普通get请求，只传入url）

import urllib
r = urllib.request.urlopen('http://www.google.com.hk/')
datatLine = r.readline()   #读取html页面的第一行
data=r.read().decode('utf8')    #读取全部
f=open("./1.html","wb")  # 网页保存在本地 
f.write(data)
f.close()

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样
info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到
geturl()：返回请求的url

urllib.parse.urlencode(url)，对关键字进行编码可使得urlopen能够识别,urlencode()的参数是kv字典型，使用urllib.parse.unquote()对关键字进行解码
那么带传参的get请求：

#一般url http://www.baidu.com
#带参数的url http://www.baidu.com/s?wd=北京
# url带中文，一般不能被识别，需要编码
wd={"wd":"北京“}
url="http://www.baidu.com/s?"
wdd=urllib.parse.urlencode(wd) #编码
url=url+wdd
req=urllib.request.Request(url)
reponse=urllib.request.urlopen(req)

1、2 POST请求（urllib.request.Request(url, postdata)，将除了url之外的postdata的其他信息也封装进去）、

代码展示

import urllib
url = 'https://passport.cnblogs.com/user/signin?'
post = {
         'username': 'xxx',
         'password': 'xxxx'
         }
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata) #提交表单
r = urllib.request.urlopen(req)

我们在进行注册、登录等操作时，会通过POST表单传递信息
这时，我们需要分析页面结构，构建表单数据post，使用urlencode()进行编码处理，返回字符串，再指定’utf-8’的编码格式，这是因为POSTdata只能是bytes或者file object。最后通过Request()对象传递postdata，使用urlopen()发送请求。

2、请求高级功能（除了封装post信息，还可以封装请求头）

2、1 urllib.request.Request

urlopen()方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求，如果请求中需要加入headers（请求头）等信息模拟浏览器，我们就可以利用更强大的Request类来构建一个请求。

#用Request类构建了一个完整的请求，增加了headers等一些信息
import urllib.request
import urllib.parse
url = 'http://httpbin.org/post'
# 构造一个请求头 headers 字典
header = {
        'Host'='http://www.baidu.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)                       Chrome/50.0.2661.102 Safari/537.36',
                     'Accept':'*/*',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Accept-Encoding': 'gzip, deflate',
        'Referer': 'http://www.baidu.com'  #表明自己来自哪里
        }
dict = {'name':'Germey'}
data = urllib.parse.urlencode(dict).encode('utf-8') #先编码POST信息。data参数如果要传必须传bytes（字节流）类型的，如果是一个字典，先用urllib.parse.urlencode()编码。
request = urllib.request.Request(url = url,data = data,headers = header,method = 'POST')
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
print(html)

2、2 urllib.request.BaseHandler （cookie）

在上面的过程中，我们虽然可以构造Request ，但是一些更高级的操作，比如Cookies处理，代理该怎样来设置？接下来就需要更强大的工具 Handler 登场了
基本的urlopen()函数不支持验证、cookie、代理或其他HTTP高级功能。要支持这些功能，必须使用build_opener()函数来创建自己的自定义opener对象。
首先介绍下 urllib.request.BaseHandler ，它是所有其他 Handler 的父类，它提供了最基本的 Handler 的方法。
- HTTPDefaultErrorHandler 用于处理HTTP响应错误，错误都会抛出 HTTPError 类型的异常。
- HTTPRedirectHandler 用于处理重定向
- HTTPCookieProcessor 用于处理 Cookie 。
- ProxyHandler 用于设置代理，默认代理为空。
- HTTPPasswordMgr用于管理密码，它维护了用户名密码的表。
- HTTPBasicAuthHandler 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

2、3 cookie的使用

获取Cookie保存到变量

import http.cookiejar
import urllib.request
#使用http.cookiejar.CookieJar()创建CookieJar对象
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
#使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
opener = urllib.request.build_opener(handler)
#将opener安装为全局
urllib.request.install_opener(opener)
response = urllib.request.urlopen('http://www.baidu.com')
#response = opener.open('http://www.baidu.com')
for item in cookie:
    print 'Name = '+item.name
    print 'Value = '+item.value

首先我们必须声明一个 CookieJar 对象，接下来我们就需要利用 HTTPCookieProcessor 来构建一个 handler ，最后利用 build_opener 方法构建出 opener ，执行 open() 即可。
最后循环输出cookiejar

获取Cookie保存到本地

import cookielib
import urllib
#设置保存cookie的文件，同级目录下的cookie.txt
filename = 'cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib.request.build_opener(handler)
#创建一个请求，原理同urllib2的urlopen
response = opener.open("http://www.baidu.com")
#保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)

从文件中获取Cookie并访问

mport cookielib
import urllib2
#创建MozillaCookieJar实例对象
cookie = cookielib.MozillaCookieJar()
#从文件中读取cookie内容到变量
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#创建请求的request
req = urllib.request.Request("http://www.baidu.com")
#利用urllib的build_opener方法创建一个opener
opener = urllib.request.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(req)
print response.read()

代理服务器设置(注意调用多次时，需要使用全局代理)

import urllib.request
def use_proxy(proxy_addr,url):
   #构建代理
   proxy=urllib.request.ProxyHandler({'http':proxy_addr})
   # 构建opener对象
    opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
   # 安装到全局
   # urllib.request.install_opener(opener)
   # data=urllib.request.urlopen(url).read().decode('utf8') 以全局方式打开
    data=opener.open(url).read().decode('utf8') # 或者直接用句柄方式打开
    return data
proxy_addr='61.163.39.70:9999'
data=use_proxy(proxy_addr,'http://www.baidu.com')
print(len(data))
## 异常处理以及日输出

opener通常是build_opener()创建的opener对象。

install_opener(opener) 安装opener作为urlopen()使用的全局URL opener

异常处理

异常处理结构如下

try:
# 要执行的代码    
    print(...)
except:
#try代码块里的代码如果抛出异常了，该执行什么内容
    print(...)
else:
#try代码块里的代码如果没有跑出异常，就执行这里
    print(...)
finally:
#不管如何，finally里的代码，是总会执行的
    print(...)

URLerror产生原因：

1、网络未连接（即不能上网）

from urllib import request, error
try:
    r=request.urlopen('http://www.baidu.com')
except error.URLError as e:
    print(e.reason)

2、访问页面不存(HTTPError)

客户端向服务器发送请求，如果成功地获得请求的资源，则返回的状态码为200，表示响应成功。如果请求的资源不存在，则通常返回404错误。

from urllib imort request, error
try:
    response = request.urlopen('http://www.baodu.com')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
else:
    print("Request Successfully')
# 加入 hasattr属性提前对属性,进行判断原因
from urllib import request,error 
try:
   response=request.urlopen('http://blog.csdn.ne')
except error.HTTPError as e:
   if hasattr(e,'code'):
      print('the server couldn\'t fulfill the request')
      print('Error code:',e.code)
   elif hasattr(e,'reason'):
      print('we failed to reach a server')
      print('Reason:',e.reason)
else:
   print('no exception was raised')
   # everything is ok