设为首页 收藏本站
查看: 856|回复: 0

[经验分享] 【python 学习之web篇】用python 3.1.2实现crawler--C

[复制链接]

尚未签到

发表于 2017-5-3 11:09:59 | 显示全部楼层 |阅读模式
【python 学习之web篇】用python 3.1.2实现crawler--C
2011年06月28日
  2011/06/28
      Retriever的实现,该类使用MyParser下载存储并分析网页内的超链接
      需要实现3部分
      1,根据url分析并创建适合存储该网页的文件及其路径,这部分放在构造函数内实现
      urllib.parse.urlparse(url) : 将url分解成6个组成部分
  urllib.parse.urlparse(urlstring, default_scheme='', allow_fragments=True) Parse a URL into six components, returning a 6-tuple. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up in smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:
  >>> from urllib.parse import urlparse>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')>>> o   # doctest: +NORMALIZE_WHITESPACEParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',            params='', query='', fragment='')>>> o.scheme'http'>>> o.port80>>> o.geturl()'http://www.cwi.nl:80/%7Eguido/Python.html'
      os.path.splitext(path) :将path分解成名字加扩展名的形式
  os.path.splitext(path) Split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period. Leading periods on the basename are ignored; splitext('.cshrc') returns ('.cshrc', '').    2,下载网页内容,这里经常会遇到能够打开但是不能下载之类的情况,需要进行异常判断。
      urllib.request.urlretrieve(self.url, self.file):将self.url的网页下载到self.file里面
  urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None) Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object exists, the object is not copied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached). Exceptions are the same as for urlopen().
  The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name). The third argument, if present, is a hook function that will be called once on establishment of the network connection and once after each block read thereafter. The hook will be passed three arguments; a count of blocks transferred so far, a block size in bytes, and the total size of the file. The third argument may be -1 on older FTP servers which do not return a file size in response to a retrieval request.
  If the url uses the http: scheme identifier, the optional data argument may be given to specify a POST request (normally the request type is GET). The data argument must in standard application/x-www-form-urlencoded format; see the urlencode() function below.
  urlretrieve() will raise ContentTooShortError when it detects that the amount of data available was less than the expected amount (which is the size reported by a Content-Length header). This can occur, for example, when the download is interrupted.
  The Content-Length is treated as a lower bound: if there’s more data to read, urlretrieve reads more data, but if less data is available, it raises the exception.
  You can still retrieve the downloaded data in this case, it is stored in the content attribute of the exception instance.
  If no Content-Length header was supplied, urlretrieve can not check the size of the data it has downloaded, and just returns it. In this case you just have to assume that the download was successful.
      user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
      header = {'User-Agent' : user_agent}
      request = urllib.request.Request(self.url, headers = header):创建一个Request对象
  class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False) This class is an abstraction of a URL request.
  url should be a string containing a valid URL.
  data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format.
  headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-372502-1-1.html 上篇帖子: 零基础学python-2.24 一些常用函数 下篇帖子: (An ((Even Better) Lisp) Interpreter (in Python))
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表