python获取知乎日报另存为txt文件

adminlng · 发表于 2015-12-2 13:26:31

　　前言
　　拿来练手的，比较简单（且有bug），欢迎交流~
　　功能介绍
　　抓取当日的知乎日报的内容，并将每篇博文另存为一个txt文件，集中放在一个文件夹下，文件夹名字为当日时间。
　　使用的库
　　re，BeautifulSoup，sys，urllib2
　　注意事项
　　1.运行环境是Linux，python2.7.x，想在win上使用直接改一下里边的命令就可以了
　　2.bug是在处理 “如何正确吐槽”的时候只能获取第一个（懒癌发作了）
　　3.直接获取（如下）内容是不可以的，知乎做了反抓取的处理

urllib2.urlop(url).read()
　　所以加个Headers就可以了
　　4.因为zhihudaily.ahorn.me这个网站时不时挂掉，所以有时候会出现错误

1 def getHtml(url):
2    header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
3    request=urllib2.Request(url,None,header)
4    response=urllib2.urlopen(request)
5    text=response.read()
6    return text
　　4.在做内容分析的时候可以直接使用re，也可以直接调用BeautifulSoup里的函数（我对正则表达式发怵，所以直接bs），比如

1 def saveText(text):
2    soup=BeautifulSoup(text)
3    filename=soup.h2.get_text()+".txt"
4    fp=file(filename,'w')
5    content=soup.find('div',"content")
6    content=content.get_text()
　　show me the code

1 #Filename:getZhihu.py
2 import re
3 import urllib2
4 from bs4 import BeautifulSoup
5 import sys
6
7 reload(sys)
8 sys.setdefaultencoding("utf-8")
9
10 #get the html code
11 def getHtml(url):
12    header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
13    request=urllib2.Request(url,None,header)
14    response=urllib2.urlopen(request)
15    text=response.read()
16    return text
17 #save the content in txt files
18 def saveText(text):
19    soup=BeautifulSoup(text)
20    filename=soup.h2.get_text()+".txt"
21    fp=file(filename,'w')
22    content=soup.find('div',"content")
23    content=content.get_text()
24
25 # print content #test
26    fp.write(content)
27    fp.close()
28 #get the urls from the zhihudaily.ahorn.com
29 def getUrl(url):
30    html=getHtml(url)
31 # print html
32    soup=BeautifulSoup(html)
33    urls_page=soup.find('div',"post-body")
34 # print urls_page
35
36    urls=re.findall('"((http)://.*?)"',str(urls_page))
37    return urls
38 #main() founction
39 def main():
40    page="http://zhihudaily.ahorn.me"
41    urls=getUrl(page)
42    for url in urls:
43       text=getHtml(url[0])
44       saveText(text)
45
46 if __name__=="__main__":
47    main()
　　

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] python获取知乎日报另存为txt文件

浏览过的版块

扫码加入运维网微信交流群