Python中第三方的用于解析HTML的库：BeautifulSoup

jiang1799 发表于 2015-12-2 15:00:14

背景
　　在Python去写爬虫，网页解析等过程中，比如：
　　如何用Python，C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站
　　常常需要涉及到HTML等网页的解析。
　　当然，对于简单的HTML中内容的提取，Python内置的正则表达式Re模块，就足够用了，
　　但是对于复杂的HTML的处理，尤其是一些非法的，有bug的html代码的处理，那么最好还是用专门的HTML的解析的库。
　　Python中的，专门用于HTML解析的库，比较好用的，就是BeautifulSoup。
　　
BeautifulSoup简介
　　Python中，专门用于HTML/XML解析的库；
　　特点是：
　　即使是有bug，有问题的html代码，也可以解析。
　　功能很强大；
　　
　　BeautifulSoup的主页是：
　　http://www.crummy.com/software/BeautifulSoup/
　　
BeautifulSoup的版本
　　BeautifulSoup主要有两个版本：
　　
BeautifulSoup 3
　　之前的，比较早的，是3.x的版本。
BeautifulSoup 3的在线文档
　　最新的，可用的，在线文档是：
　　http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
　　中文版的是：
　　http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
　　
下载BeautifulSoup 3
　　http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/
　　中可以下载到很多版本，比如我常用的3.0.6的版本：
　　BeautifulSoup-3.0.6.py
　　http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py
　　
BeautifulSoup 4：缩写为bs4
　　最新的v4版本的BeautifulSoup，改名为bs4了。
　　
　　注意：
　　使用bs4时，导入BeautifulSoup的写法是：

?
1

from bs4 import BeautifulSoup;　　然后就可以像之前3.x中一样，直接使用BeautifulSoup了。
　　详见：
　　【已解决】Python3中，已经安装了bs4（Beautifulsoup 4）了，但是却还是出错：ImportError: No module named BeautifulSoup

bs4的在线文档
　　http://www.crummy.com/software/BeautifulSoup/bs4/doc/
　　
下载bs4
　　http://www.crummy.com/software/BeautifulSoup/bs4/download/
　　可以下载到对应的bs4的版本，比如：
　　此时最新的版本是：
　　beautifulsoup4-4.1.3.tar.gz
　　http://www.crummy.com/software/BeautifulSoup/bs4/download/beautifulsoup4-4.1.3.tar.gz
BeautifulSoup的用法
如何安装BeautifulSoup
3.0.6之前：无需安装，放到和Python文件同目录下即可使用
　　3.0.6之前，都是不需要安装的，所以使用起来最简单，直接下载对应的版本，比如：
　　http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py
　　得到了BeautifulSoup-3.0.6.py，然后改名为：BeautifulSoup.py
　　然后，放到和你当前的python文件同目录下，比如我当前python文件是：
　　D:\tmp\tmp_dev_root\python\beautifulsoup_demo\beautifulsoup_demo.py
　　那就放到
　　D:\tmp\tmp_dev_root\python\beautifulsoup_demo\
　　下面，和beautifulsoup_demo.py同目录。
　　
3.0.6之后：需要安装BeautifulSoup后才可使用
　　关于如何安装一个Python的第三方模块，简单说就是，进入对应目录，运行：

?
1

setup.py install　　详细解释可参考：
　　【总结】Python安装第三方的库、package的方法
　　
如何使用BeautifulSoup
　　在你的Python文件，此处为beautifulsoup_demo.py，中直接import即可。
　　
　　关于示例html代码，比如使用：
　　【教程】抓取网并提取网页中所需要的信息之 Python版
　　
　　相关参考文档：
　　3.x版本的：
　　find(name, attrs, recursive, text, **kwargs)
　　
使用BeautifulSoup提取html中的某个内容
　　关于最简单的，最基本的用法，提取html中的某个内容，具体用法，就死使用对应的find函数。
　　完整代码是：
　　

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【教程】Python中第三方的用于解析HTML的库：BeautifulSoup

http://www.crifan.com/python_third_party_lib_html_parser_beautifulsoup

Author: Crifan Li
Version: 2012-12-26
Contact: admin at crifan dot com
"""

from BeautifulSoup import BeautifulSoup;

def beautifulsoupDemo():
demoHtml = """
<html>
<body>
<div class="icon_col">
<h1 class="h1user">crifan</h1>
</div>
</body>
</html>
""";
soup = BeautifulSoup(demoHtml);
print "type(soup)=",type(soup); #type(soup)= <type 'instance'>
print "soup=",soup;

# 1. extract content
# method 1: no designate para name
#h1userSoup = soup.find("h1", {"class":"h1user"});
# method 2: use para name
h1userSoup = soup.find(name="h1", attrs={"class":"h1user"});
# more can found at:
#http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#find%28name,%20attrs,%20recursive,%20text,%20**kwargs%29
print "h1userSoup=",h1userSoup; #h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr = h1userSoup.string;
print "h1userUnicodeStr=",h1userUnicodeStr; #h1userUnicodeStr= crifan

if __name__ == "__main__":
beautifulsoupDemo();　　输出为：

?
1
2
3
4
5
6
7
8
9
10
11
12
13

D:\tmp\tmp_dev_root\python\beautifulsoup_demo>beautifulsoup_demo.py
type(soup)= <type 'instance'>
soup=
<html>
<body>
<div class="icon_col">
<h1 class="h1user">crifan</h1>
</div>
</body>
</html>

h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr= crifan　　
使用BeautifulSoup修改/改变/替换原先html中的某个内容
　　如果需要改变原先html中的某个值，可以参考官网解释：
　　修改属性值
　　后来证实，只能改（Tag的）中的属性的值，不能改（Tag的）的值本身
　　
　　完整示例代码为：

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【教程】Python中第三方的用于解析HTML的库：BeautifulSoup

http://www.crifan.com/python_third_party_lib_html_parser_beautifulsoup

Author: Crifan Li
Version: 2013-02-01
Contact: admin at crifan dot com
"""

from BeautifulSoup import BeautifulSoup;

def beautifulsoupDemo():
demoHtml = """
<html>
<body>
<div class="icon_col">
<h1 class="h1user">crifan</h1>
</div>
</body>
</html>
""";
soup = BeautifulSoup(demoHtml);
print "type(soup)=",type(soup); #type(soup)= <type 'instance'>
print "soup=",soup;

print '{0:=^80}'.format(" 1. extract content ");
# method 1: no designate para name
#h1userSoup = soup.find("h1", {"class":"h1user"});
# method 2: use para name
h1userSoup = soup.find(name="h1", attrs={"class":"h1user"});
# more can found at:
#http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#find%28name,%20attrs,%20recursive,%20text,%20**kwargs%29
print "h1userSoup=",h1userSoup; #h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr = h1userSoup.string;
print "h1userUnicodeStr=",h1userUnicodeStr; #h1userUnicodeStr= crifan

print '{0:=^80}'.format(" 2. demo change tag value and property ");
print '{0:-^80}'.format(" 2.1 can NOT change tag value ");
print "old tag value=",soup.body.div.h1.string; #old tag value= crifan
changedToString = u"CrifanLi";
soup.body.div.h1.string = changedToString;
print "changed tag value=",soup.body.div.h1.string; #changed tag value= CrifanLi
print "After changed tag value, new h1=",soup.body.div.h1; #After changed tag value, new h1= <h1 class="h1user">crifan</h1>

print '{0:-^80}'.format(" 2.2 can change tag property ");
soup.body.div.h1['class'] = "newH1User";
print "changed tag property value=",soup.body.div.h1; #changed tag property value= <h1 class="newH1User">crifan</h1>

if __name__ == "__main__":
beautifulsoupDemo();　　
　　
总结
　　更多的，用法和使用心得，部分内容，已整理到：
　　【总结】Python的第三方库BeautifulSoup的使用心得
　　【整理】关于Python中的html处理库函数BeautifulSoup使用注意事项

页: [1]

运维网's Archiver

Python中第三方的用于解析HTML的库：BeautifulSoup