Python 中的 urllib2 模块

koflover 发表于 2018-8-5 13:45:01

　　通过python 的 urllib2 模块，可以轻易的去模拟用户访问网页的行为。
　　这里将自己的学习过程简单的记录下来。
　　一、urlopen函数
　　urlopen(url, data=None) -- Basic usage is the same as original
　　urllib.pass the url and optionally data to post to an HTTP URL, and
　　get a file-like object back.One difference is that you can also pass
　　a Request instance instead of URL.Raises a URLError (subclass of
　　IOError); for HTTP errors, raises an HTTPError, which can also be
　　treated as a valid response.
　　它的基本用法同urllib 库中的用法是一样的。urllib 中的urlopen 的注释如下：
　　urlopen(url, data=None, proxies=None)
　　Create a file-like object for the specified URL to read from.
　　但不同于urllib 的是，urllib2 中的urlopen函数的第一个参数url 可以是一个Request 实例。
　　1、基本用法
　　Example:
#等同urllib 中的urlopen 函数的用法　　
In : response = urllib2.urlopen('http://www.baidu.com')
　　
In : response.read()
　　

　　
# urllib2 中的使用request 实例的用法
　　
In : request = urllib2.Request('http://www.baidu.com')
　　
In : response = urllib2.urlopen(request)
　　
In : response.read()
　　我在这里还是非常喜欢第二种使用方式。毕竟一个http 的请求首先要有request，然后才能存在response。这样在编程的思路上就比较明了了。代码阅读起来也很清晰。
　　2、模拟POST请求
　　以上所模拟的请求，全部都是GET方式的请求，那如果需要模拟POST方式的请求呢？
　　查看Request的帮助help(urllib2.Request) 中发现，它的__init__ 构造函数是这样声明的
　　__init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False)
　　从声明上来看POST 的数据可以放到data 中，且我们还可以通过headers 设置http的请求头参数
　　Example:
import urllib　　
import urllib2
　　

　　
values = {}
　　
values['username'] = "God"
　　
values['password'] = "XXXX"
　　
data = urllib.urlencode(values)# 使用了urllib库中的urlencode方法
　　
url = "http://xxxx.xxxxx/login"
　　
request = urllib2.Request(url,data)
　　
response = urllib2.urlopen(request)
　　
print response.read()
　　大家可以针对具体的场景去更换自己的url、username 和 password
　　3、设置HTTP请求头
　　再通过headers参数去尝试一下修改http 请求头的一些信息。在上一个例子中进行稍微的修改
import urllib　　
import urllib2
　　

　　
values = {}
　　
values['username'] = "God"
　　
values['password'] = "XXXX"
　　
data = urllib.urlencode(values)
　　
url = "http://xxxx.xxxxx/login"
　　
headers = {'User-Agent':'ozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0','Content-Type':'text/html; ','Referer':'http://www.baidu.com/'}
　　
request = urllib2.Request(url,data,headers)
　　
response = urllib2.urlopen(request)
　　
print response.read()
　　可以通过浏览器提供的F12功能去找到更多的头信息。
　　4、设置请求超时
　　好多时候各种原因，有可能导致你的请求各种等待。考验耐心的时候到了，不过这时可用通过设置urlopen 中的超时去干掉那些我们无法容忍的长时间没法响应的请求。
　　urlopen(url, data=None, timeout=<object object>)
　　使用timeout 的时候要注意的一点是，如果你没有data数据，那么这时你一定要显示的传递参数。
　　Example:
import urllib2　　
urllib2.urlopen('http://www.baidu.com',data,10)
　　
urllib2.urlopen('http://www.baidu.com',timeout=10)
　　二、opener(OpenerDirector)
　　The OpenerDirector manages a collection of Handler objects that do
　　all the actual work.Each Handler implements a particular protocol or
　　option.The OpenerDirector is a composite object that invokes the
　　Handlers needed to open the requested URL.For example, the
　　HTTPHandler performs HTTP GET and POST requests and deals with
　　non-error returns.The HTTPRedirectHandler automatically deals with
　　HTTP 301, 302, 303 and 307 redirect errors, and the HTTPDigestAuthHandler
　　deals with digest authentication
　　干嘛用的? 管理了一系列的handler 对象。我这这么理解的，其实我们在使用urlopen 的时候就已经存在了一个默认的handler 。只是对我们时透明的。我们可以使用这个handler做GET/POST 请求，但是如果我们想做一些其他的事情呢？如我们想设置代理去做一些事情等所有非GET/POST能处理好的。那么我们就需要更换handler了。这时就要使用opener ，这就时opener 所能干的。
　　1、设置代理
import urllib2　　
proxy_handler = urllib2.ProxyHandler({"http" : 'http://11.11.11.11:8080'})
　　
opener = urllib2.build_opener(proxy_handler)
　　
urllib2.install_opener(opener)
　　
response = urllib2.urlopen('http://xxx.xxx.xxxx')
　　
response.read()
　　2、打开http 和 https 的 Debug log 功能
import urllib2　　
httpHandler = urllib2.HTTPHandler(debuglevel=1)
　　
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
　　
opener = urllib2.build_opener(httpHandler, httpsHandler)
　　
urllib2.install_opener(opener)
　　
response = urllib2.urlopen('http://www.baidu.com')
　　3、结合cookielib 处理 cookie 信息
　　首先要简单的了解一下cookielib 这个模块，功能还是很强大的。最好仔细研究一下
　　这里我们只研究 opener 相关，暂时略过cookielib 模块
import urllib2　　
import cookielib
　　

　　
cookie = cookielib.CookieJar()
　　
cookieHandler=urllib2.HTTPCookieProcessor(cookie)
　　
opener = urllib2.build_opener(cookieHandler)
　　
urllib2.install_opener(opener)
　　
response = urllib2.urlopen('http://www.baidu.com')
　　
for item in cookie:
　　
print 'CookieName = '+item.name
　　
print 'CookieValue = '+item.value
　　三、异常处理URLError 和 HTTPError
　　HTTPError 是 URLError 的一个子类
　　URLError
　　HTTPError(URLError, urllib.addinfourl)
import urllib2　　

　　
req = urllib2.Request('http://www.baidu.com/mmmaa')
　　
try:
　　
urllib2.urlopen(req)
　　
except urllib2.HTTPError, e:
　　
if hasattr(e,"code"):
　　
print e.code
　　
except urllib2.URLError, e:
　　
if hasattr(e,"reason"):
　　
print e.reason
　　
else:
　　
print "OK"

页: [1]

运维网's Archiver

Python 中的 urllib2 模块