python html parser库lxml的介绍和使用(快速入门)
http://blog.csdn.net/marising/article/details/5821090
lxm是python的一个html/xml解析并建立dom的库,lxml的特点是功能强大,性能也不错,xml包含了ElementTree ,html5lib ,beautfulsoup 等库,但是lxml也有自己相对应的库,所以,导致lxml比较复杂,初次使用者很难了解其关系。 1. 解析html并建立dom
>>> import lxml.etree as etree
>>> html = '<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
>>> dom = etree.fromstring(html)
>>> etree.tostring(dom)
'<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
如果用beautifulsoup的解析器,则
>>> import lxml.html.soupparser as soupparser
>>> dom = soupparser.fromstring(html)
>>> etree.tostring(dom)
'<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
但是我强烈建议使用soupparser,因为其处理不规范的html的能力比etree强太多。
2. 按照Dom访问Element 子元素长度
>>> len(dom)
1
访问子元素:
>>> dom[0].tag
'body'
循环访问:
>>> for child in dom:
... print child.tag
...
body
查看节点索引
>>>body = dom[0]
>>> dom.index(body)
0
字节点获取父节点
>>> body.getparent().tag
'html'
访问所有子节点
>>> for ele in dom.iter():
... print ele.tag
...
html
body
div
div 遍历和打印所有子节点:
>>> children = list(root)
>>> for child in root:
... print(child.tag)
元素的兄弟或邻居节点是通过next和previous属性来访问的
The siblings (or neighbours) of an element are accessed as next and previous elements:
>>> root[0] is root[1].getprevious() # lxml.etree only!
True
>>> root[1] is root[0].getnext() # lxml.etree only!
True
貌似返回本文档中所有文字信息
body.text_content()返回本节点所有文本信息。 5.Xpath的支持
所有的div元素
>>> for ele in dom.xpath('//div'):
... print ele.tag
...
div
div
id=“1”的元素
>>> dom.xpath('//*[@id="1"]')[0].tag
'body'