Python爬取网页的三种方法

seemebaby 发表于 2018-8-16 12:42:53

　　# Python爬取网页的三种方法之一:使用urllib或者urllib2模块的getparam方法
　　import urllib
　　fopen1 = urllib.urlopen('http://www.baidu.com').info()
　　fopen2 = urllib2.urlopen('http://www.sina.com').info()
　　print fopen1.getparam('charset')
　　print fopen2.getparam('charset')
　　#----有些网站有反爬虫技术，需要如下办法----
　　url = 'http://www.qiushibaike.com/hot/page/1'
　　user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
　　headers = { 'User-Agent' : user_agent }
　　request = urllib2.Request(url,headers = headers)
　　c_res=urllib2.urlopen(request).info()
　　print c_res.getparam('charset')
　　# Python爬取网页的三种方法之二 : 使用chardet模块 ---感觉比方法一速度慢一点
　　import chardet
　　import urllib
　　#先获取网页内容
　　data1 = urllib.urlopen('http://www.baidu.com').read()
　　#用chardet进行内容分析
　　chardit1 = chardet.detect(data1)
　　print chardit1['encoding']
　　#----有些网站有反爬虫技术，需要如下办法----
　　url = 'http://www.qiushibaike.com/hot/page/1'
　　user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
　　headers = { 'User-Agent' : user_agent }
　　response = urllib2.urlopen(request).read()
　　chardit1 = chardet.detect(response)
　　print chardit1['encoding']
　　# Python爬取网页的三种方法之三 : 利用BeautifulSoup模块方法
　　from bs4 import BeautifulSoup
　　import urllib2
　　content=urllib2.urlopen('http://www.baidu.com')
　　soup=BeautifulSoup(content)
　　print soup.original_encoding #这里的输出就是网页的编码方式
　　#----有些网站有反爬虫技术，需要与上述两办法类似处理----

页: [1]

运维网's Archiver

Python爬取网页的三种方法