用python解析html

北风留影 发表于 2015-4-24 07:08:07

　　http://blog.iyunv.com/adrianfeng/article/details/5881850
　　
　　python中，有三个库可以解析html文本，HTMLParser,sgmllib,htmllib。他们的实现方法不通，但功能差不多。这三个库中提供解析html的类都是基类，本身并不做具体的工作。他们在发现的元件后（如标签、注释、声名等），会调用相应的函数，这些函数必须重载，因为基类中不作处理。
　　
　　比如：
　　"""Advice
The IETF admonishes:
Be strict in what you send.

"""
　　
　　如果对这个数据做处理，当检测到标签时，对于HTMLParser，会调用handle_starttag函数。
　　
　　
　　下面具体介绍下几个库
1、HTMLParser

view plaincopy

[*]#------------------ HTMLParser_stack.py ------------------#
[*]#-- coding: GBK --
[*]import HTMLParser,sys,os,string
[*]html = """Advice
[*]The IETF admonishes:
[*]Be strict in what you send.
[*]
[*]
[*]
[*]"""
[*]
[*]tagstack = []
[*]class ShowStructure(HTMLParser.HTMLParser):
[*] def handle_starttag(self, tag, attrs): tagstack.append(tag)
[*] def handle_endtag(self, tag): tagstack.pop()
[*] def handle_data(self, data):
[*]    if data.strip():
[*]          for tag in tagstack: sys.stdout.write('/'+tag)
[*]          sys.stdout.write(' >> %s/n' % data[:40].strip())
[*]ShowStructure().feed(html)
　　

此函数的输出：
/html/body/p >> The
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> Be strict in what you
/html/body/p/a/i/b >> send
/html/body/p/a/i >> .

对于一些网页，可能并没有严格的开始结束标签对，这时，我们可以去忽略一些标签。可以自己写个堆栈来处理这些标签。

view plaincopy

[*]#*--------------- TagStack class example -----------------#
[*]class TagStack:
[*] def __init__(self, lst=[]): self.lst = lst
[*] def __getitem__(self, pos): return self.lst
[*] def append(self, tag):
[*] # Remove every paragraph-level tag if this is one
[*] if tag.lower() in ('p','blockquote'):
[*]    self.lst = [t for t in self.lst
[*]             if t not in ('p','blockquote')]
[*]    self.lst.append(tag)
[*] def pop(self, tag):
[*]    # "Pop" by tag from nearest pos, not only last item
[*]    self.lst.reverse()
[*]    try:
[*]    pos = self.lst.index(tag)
[*]    except ValueError:
[*]          raise HTMLParser.HTMLParseError, "Tag not on stack"
[*]    del self.lst
[*]    self.lst.reverse()
[*]tagstack = TagStack()
　　

HTMLParser有个bug，就是不能处理中文属性，比如说，如果网页里有这么一段：

那么解析到这一行时就会出错。
　　

错误原因还是正则表达式惹的祸。

attrfind = re.compile(
r'/s*([-.:a-zA-Z_0-9]*)(/s*=/s*'
r'(/'[^/']*/'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$/(/)_#=~@]*))?')
attrfind 没有匹配中文字符。
可以更改这个匹配已修正这个错误。sgmllib则不存在这种错误。

2、sgmllib

html格式为sgml格式的一个子集，所以sgml可以处理跟多的东西，下面通过一段代码来示例sgmllib的用法。
　　

view plaincopy

[*]#------------------ HTMLParser_stack.py ------------------#
[*]#-- coding: GBK --
[*]import sgmllib,sys,os,string
[*]html = """Advice
[*]The IETF admonishes:
[*]Be strict in what you send.
[*]
[*] 我
[*]
[*]"""
[*]
[*]os.chdir('d://python')
[*]f=file('testboard.txt','r')
[*]contest=f.read()
[*]tagstack = []
[*]class ShowStructure(sgmllib.SGMLParser):
[*] def handle_starttag(self, tag, method,attrs): tagstack.append(tag)
[*] def handle_endtag(self, tag): tagstack.pop()
[*] def handle_data(self, data):
[*]    if data.strip():
[*]          for tag in tagstack: sys.stdout.write('/'+tag)
[*]          sys.stdout.write(' >> %s/n' % data[:40].strip())
[*]
[*] def unknown_starttag(self,tag,attrs):
[*]    print 'start tag:'
[*] def unknown_endtag(self,tag):
[*]    print 'end tag:'
[*] def start_lala(self,attr):
[*]    print 'lala tag found'
[*]ShowStructure().feed(html)
　　
输出:

start tag:
start tag:
/lala >> Advice
end tag:
end tag:
start tag:
start tag:
/lala >> The
start tag:
/lala >> IETF admonishes:
start tag:
/lala >> Be strict in what you
start tag:
/lala >> send
end tag:
/lala >> .
end tag:
end tag:
end tag:
start tag:
start tag:
/lala >> &upsih;
start tag:
end tag:
end tag:
end tag:

和HTMLParser一样，如果要用sgmllib解析html，则要继承sgmllib.SGMLParser类，此类里的函数都是空的，用户需要重载它。这个类提供的功能是在特定情况下调用相应的函数。
比如当发现标签时，如果并没有定义 start_html(self,attr)函数，则会调用unknown_starttag函数，具体怎么处理则更具用户。
sgml的标签是可以自定义的，比如自己定义了一个start_lala函数，则就会处理标签。

有个地方要说明下，如果定义了start_tagname函数，有定义了handle_starttag函数，则函数只会运行handle_starttag函数，start_tagname为空函数都没有问题，如果没有定义handle_starttag函数，则遇到标签时，会运行start_tagname函数。如果没有定义tagname的start函数，则此标签为未知标签，调用unknown_starttag函数

页: [1]

运维网's Archiver

用python解析html