python + lxml 抓取网页，不需用正则，用xpath

在水一万 · 发表于 2017-5-5 11:35:48

我的第一个python入门程序：
python + lxml 抓取网页，不需用正则，用xpath

# -*- coding:gb2312 -*-
import urllib
import hashlib
import os
class Spider:
'''crawler html'''
def get_html(self,url):
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
return htmlSource
def cache_html(self,filename,htmlSource):
f = open(filename,'w')
f.write(htmlSource)
f.close
def analysis_html(self,htmlSource):
#from lxml import etree
import lxml.html.soupparser as soupparser
dom = soupparser.fromstring(htmlSource)
#doc = dom.parse(dom)
r = dom.xpath(".//*[@id='lh']/a[2]")
print len(r)
print r[0].tag
'''
这里直接输出中文print r[0].text 会报错，所以用了encode('gb2312')
并且在文件头部声明了文件编码类型
参考：http://blogold.chinaunix.net/u2/60332/showart_2109290.html
'''
print r[0].text.encode('gb2312')
print 'done'
def get_cache_html(self,filename):
if not os.path.isfile(filename):
return ''
f = open(filename,'r')
content = f.read()
f.close()
return content
if __name__ == '__main__':
spider = Spider()
url = 'http://www.baidu.com'
md5_str = hashlib.md5(url).hexdigest()
filename = "html-"+md5_str+".html"
htmlSource = spider.get_cache_html(filename);
if not htmlSource:
htmlSource = spider.get_html(url)
spider.cache_html(filename,htmlSource)
spider.analysis_html(htmlSource)

程序流程：
抓取页面：get_html
保存页面：cache_html
分析页面：analysis_html
辅助方法：get_cache_html，如果已经抓取过的页面，保存为本地文件，下一次直接从本地文件取html内容，不用再次通过网络抓取
xpath分析工具：firefox插件，firepath

lxml 学习参考：http://lxml.de/index.html

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] python + lxml 抓取网页，不需用正则，用xpath

浏览过的版块

扫码加入运维网微信交流群

[经验分享] python + lxml 抓取网页 ，不需用正则，用xpath

浏览过的版块

[经验分享] python + lxml 抓取网页，不需用正则，用xpath