hc6538 发表于 2015-8-3 09:36:24

Apache Nutch 1.3 学习笔记七(CrawlDb

  
  这里主要看一下CrawlDb中的updatedb,它主要是用来更新CrawlDb数据库的
1. bin/nutch updatedb
  我们用nutch的命令行时会看到一个方法叫updatedb,其实这个方法就是调用CrawlDb.java类中的update方法,它的参数帮助如下:
  

[*]
Usage: CrawlDb (-dir | ...) [-force] [-normalize] [-filter] [-noAdditions]
[*]
    crawldb CrawlDb to update
[*]
    -dir segments   parent directory containing all segments to update from
[*]
    seg1 seg2 ...   list of segment names to update from
[*]
    -forceforce update even if CrawlDb appears to be locked (CAUTION advised)
[*]
    -normalizeuse URLNormalizer on urls in CrawlDb and segment (usually not needed)
[*]
    -filter use URLFilters on urls in CrawlDb and segment
[*]
    -noAdditions    only update already existing URLs, don't add any newly discovered URLs
  
2. 下面我们来分析一下其update方法到底做了些什么
2.1 update的任务提交参数,部分代码如下
  

[*]
// 生成一个新的任务,这里面也做了一些相应的配置,
[*]
    // 加入了current目录,就是初始的CrawlDb目录,设置了输入格式为SequenceFileInputFormat
[*]
    // 配置了Map-Reducer为CrawlDbFilter-CrawlDbReducer
[*]
    // 配置了输出格式为MapFileOutputFormat
[*]
    // 还配置了输出的类型
[*]
    JobConf job = CrawlDb.createJob(getConf(), crawlDb);
[*]
    // 配置一些参数
[*]
    job.setBoolean(CRAWLDB_ADDITIONS_ALLOWED, additionsAllowed);
[*]
    job.setBoolean(CrawlDbFilter.URL_FILTERING, filter);
[*]
    job.setBoolean(CrawlDbFilter.URL_NORMALIZING, normalize);
[*]
    // 加入输入目录,一个是crawl_fetch,另一个是crawl_parse
[*]
    for (int i = 0; i
页: [1]
查看完整版本: Apache Nutch 1.3 学习笔记七(CrawlDb