python 爬虫利器优美的BeautifulSoup

ggttt · 发表于 2015-10-10 08:41:17

近期在研究py的网络编程，编写爬虫也是顺利成章的，开始在纠结与用正则表达式来匹配，到后来发现了Beautifulsoup，用他可以非常完美的帮我完成了这些任务：
Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。
简单使用说明：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
...
... The Dormouse's story
...
... Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.
...
... ...
... """
>>> soup = BeautifulSoup(html_doc)
>>> soup.head()
[<title>The Dormouse's story</title>]
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.string
u"The Dormouse's story"
>>> soup.body.b
The Dormouse's story
>>> soup.body.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.get_text()
u"... The Dormouse's story\n... \n... The Dormouse's story\n... \n... Once upon a time there were three little sisters; and their names were\n... Elsie,\n... Lacie and\n... Tillie;\n... and they lived at the bottom of a well.\n... \n... ...\n... "
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> for key in soup.find_all('a'):
... print key.get('class'),key.get('href')
...
['sister'] http://example.com/elsie
['sister'] http://example.com/lacie
['sister'] http://example.com/tillie

###通过里面的方法，可以很快调出里面的元素和结果：
简单说明：
soup.body：表示显示body标签下面的内容，也可以用.来叠加标签：
soup.title.string:表示现在titile的文本内容
soup.get_text()：表示显示所有文本内容：
soup.find_all():方式可以随意组合，也可以通过任意标签，包括class，id 等方式：
举例说明：以我常常看的直播表新闻为例；
1、首先看看我们要获得的内容：
QQ截图20151010084107.png

我要获取的是上面那一栏热点新闻：如世预赛国足不敌卡塔而
2、源代码查看：

1
2

<div class="fb_bbs"><a href="http://news.zhibo8.cc/zuqiu/" style="padding: 0 5px 0 0;" target="_blank" title="足球新闻"><img src="/css/images/football.png"/></a><a href="http://news.zhibo8.cc/zuqiu/" target="_blank"> 世预赛：国足0-1不敌卡塔
尔</a>|<a href="http://news.zhibo8.cc/zuqiu/2015-10-09/5616a910d74ac.htm" target="_blank">国足“刷卡”耻辱：11年不胜</a>|<a hf="http://news.zhibo8.cc/zuqiu/2015-10-09/5616b22cbd134.htm" target="_blank">切尔西签下阿梅利亚</a>|<a href="http://news.zhibo8.cc/zuqiu/2015-10-09/5616daa45ee48.htm" target="_blank">惊人！莱万5场14球</a>|<a href="http://tu.zhibo8.cc/zuqiu/" target="_blank">图-FIFA16中国球员</a></div>

###从源码看到，这个是一个div 标签包裹的一个class=“fb_bbs”的版块，当然我们要确保这个是唯一的。
3、用BeautifulSoup来分析出结果代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

#coding=utf-8
import urllib,urllib2
from bs4 import BeautifulSoup
try:
html = urllib2.urlopen("http://www.zhibo8.cc")
except urllib2.HTTPError as err:
print str(err)
soup = BeautifulSoup(html)
for i in soup.find_all("div",attrs={"class":"fb_bbs"}):
result = i.get_text().split("|")
for term in result:
print term

4、执行效果：

[iyunv@master network]# python url.py
世预赛：国足0-1不敌卡塔尔
国足“刷卡”耻辱：11年不胜
切尔西签下阿梅利亚
惊人！莱万5场14球
图-FIFA16中国球员
利物浦官方宣布克洛普上任
档案：克洛普的安菲尔德之旅
欧预赛-德国爆冷0-1爱尔兰
葡萄牙1-0胜丹麦
图-穆帅难罢手

到此任务差不多完成，代码量比re模块少了很多，而且简洁唯美，用py做爬虫确实是个利器；

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] python 爬虫利器优美的BeautifulSoup

相关帖子

浏览过的版块

扫码加入运维网微信交流群