Python.Scrapy.11-scrapy-source-code-analysis-part-1

lihu129c · 发表于 2015-11-30 14:07:19

Scrapy 源代码分析系列－1 spider, spidermanager, crawler, cmdline, command
　　分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6
　　如github 中Scrapy 源码树所示，包含的子包有:
　　commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlib
　　包含的模块有:
　　_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py,
　　extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py,
　　middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,
　　spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py
　　先从重要的模块进行分析。

0. scrapy依赖的第三方库或者框架
　　twisted

1. 模块: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py
　　spider.py定义了spider的基类: BaseSpider. 每个spider实例只能有一个crawler属性。那么crawler具备哪些功能呢?
　　crawler.py定义了类Crawler，CrawlerProcess。
　　类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine,  以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS
　　、LOG_FORMATTER
　　类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。
　　启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。
　　
　　spidermanager.py定义类SpiderManager, 类SpiderManager用来创建和管理所有website-specific的spider。

1 class SpiderManager(object):
2
3    implements(ISpiderManager)
4
5    def __init__(self, spider_modules):
6       self.spider_modules = spider_modules
7       self._spiders = {}
8       for name in self.spider_modules:
9          for module in walk_modules(name):
10                self._load_spiders(module)
11
12    def _load_spiders(self, module):
13       for spcls in iter_spider_classes(module):
14          self._spiders[spcls.name] = spcls
　　
　　
　　

1.2 cmdline.py command.py
　　cmdline.py定义了公有函数: execute(argv=None, settings=None)。
　　函数execute是工具scrapy的入口方法(entry method)，如下所示:

1 XiaoKL$ cat `which scrapy`
2 #!/usr/bin/python
3
4 # -*- coding: utf-8 -*-
5 import re
6 import sys
7
8 from scrapy.cmdline import execute
9
10 if __name__ == '__main__':
11    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
12    sys.exit(execute())
　　
　　所以可以根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

1 def execute(argv=None, settings=None):
2    if argv is None:
3       argv = sys.argv
4
5    # --- backwards compatibility for scrapy.conf.settings singleton ---
6    if settings is None and 'scrapy.conf' in sys.modules:
7       from scrapy import conf
8       if hasattr(conf, 'settings'):
9          settings = conf.settings
10    # ------------------------------------------------------------------
11
12    if settings is None:
13       settings = get_project_settings()
14    check_deprecated_settings(settings)
15
16    # --- backwards compatibility for scrapy.conf.settings singleton ---
17    import warnings
18    from scrapy.exceptions import ScrapyDeprecationWarning
19    with warnings.catch_warnings():
20       warnings.simplefilter("ignore", ScrapyDeprecationWarning)
21       from scrapy import conf
22       conf.settings = settings
23    # ------------------------------------------------------------------
24
25    inproject = inside_project()
26    cmds = _get_commands_dict(settings, inproject)
27    cmdname = _pop_command_name(argv)
28    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
29       conflict_handler='resolve')
30    if not cmdname:
31       _print_commands(settings, inproject)
32       sys.exit(0)
33    elif cmdname not in cmds:
34       _print_unknown_command(settings, cmdname, inproject)
35       sys.exit(2)
36
37    cmd = cmds[cmdname]
38    parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
39    parser.description = cmd.long_desc()
40    settings.setdict(cmd.default_settings, priority='command')
41    cmd.settings = settings
42    cmd.add_options(parser)
43    opts, args = parser.parse_args(args=argv[1:])
44    _run_print_help(parser, cmd.process_options, args, opts)
45
46    cmd.crawler_process = CrawlerProcess(settings)
47 _run_print_help(parser, _run_command, cmd, args, opts)
48    sys.exit(cmd.exitcode)
　　
　　execute()函数主要做: 对命令行进行解析并对scrapy命令模块进行加载；解析命令行参数；获取设置信息；创建CrawlerProcess对象。
　　CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。
　　自然我们需要来查看定义类ScrapyCommand的模块: command.py。
　　ScrapyCommand的子类在子包scrapy.commands中进行定义。
　　
　　_run_print_help() 函数最终调用cmd.run()，来执行该命令。如下:

1 def _run_print_help(parser, func, *a, **kw):
2    try:
3       func(*a, **kw)
4    except UsageError as e:
5       if str(e):
6          parser.error(str(e))
7       if e.print_help:
8          parser.print_help()
9       sys.exit(2)
　　
　　func是参数_run_command，该函数的实现主要就是调用cmd.run()方法:

1 def _run_command(cmd, args, opts):
2    if opts.profile or opts.lsprof:
3       _run_command_profiled(cmd, args, opts)
4    else:
5       cmd.run(args, opts)
　　
　　我们在进行设计时可以参考这个cmdline/commands无关的设计。
　　
　　command.py: 定义类ScrapyCommand，该类作为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

1 class ScrapyCommand(object):
2
3    requires_project = False
4    crawler_process = None
5
6    # default settings to be used for this command instead of global defaults
7    default_settings = {}
8
9    exitcode = 0
10
11    def __init__(self):
12       self.settings = None  # set in scrapy.cmdline
13
14    def set_crawler(self, crawler):
15       assert not hasattr(self, '_crawler'), "crawler already set"
16       self._crawler = crawler
17
18    @property
19    def crawler(self):
20       warnings.warn("Command's default `crawler` is deprecated and will be removed. "
21          "Use `create_crawler` method to instatiate crawlers.",
22          ScrapyDeprecationWarning)
23
24       if not hasattr(self, '_crawler'):
25          crawler = self.crawler_process.create_crawler()
26
27          old_start = crawler.start
28          self.crawler_process.started = False
29
30          def wrapped_start():
31                if self.crawler_process.started:
32                   old_start()
33                else:
34                   self.crawler_process.started = True
35                   self.crawler_process.start()
36
37          crawler.start = wrapped_start
38
39          self.set_crawler(crawler)
40
41       return self._crawler
42
43    def syntax(self):
44
45    def short_desc(self):
46
47    def long_desc(self):
48
49    def help(self):
50
51    def add_options(self, parser):
52
53    def process_options(self, args, opts):
54
55    def run(self, args, opts):
　　
　　类ScrapyCommand的类属性:

requires_project: 是否需要在Scrapy project中运行
crawler_process：CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

　　类ScrapyCommand的方法，重点关注:

def crawler(self): 延迟创建Crawler对象。

def run(self, args, opts): 需要子类进行覆盖实现。

　　那么我们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。
　　
　　To Be Continued:
　　接下来分析模块: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2
　　

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] Python.Scrapy.11-scrapy-source-code-analysis-part-1

浏览过的版块

扫码加入运维网微信交流群