Apache Nutch 1.3 学习笔记七(CrawlDb
这里主要看一下CrawlDb中的updatedb,它主要是用来更新CrawlDb数据库的
1. bin/nutch updatedb
我们用nutch的命令行时会看到一个方法叫updatedb,其实这个方法就是调用CrawlDb.java类中的update方法,它的参数帮助如下:
[*]
Usage: CrawlDb (-dir | ...) [-force] [-normalize] [-filter] [-noAdditions]
[*]
crawldb CrawlDb to update
[*]
-dir segments parent directory containing all segments to update from
[*]
seg1 seg2 ... list of segment names to update from
[*]
-forceforce update even if CrawlDb appears to be locked (CAUTION advised)
[*]
-normalizeuse URLNormalizer on urls in CrawlDb and segment (usually not needed)
[*]
-filter use URLFilters on urls in CrawlDb and segment
[*]
-noAdditions only update already existing URLs, don't add any newly discovered URLs
2. 下面我们来分析一下其update方法到底做了些什么
2.1 update的任务提交参数,部分代码如下
[*]
// 生成一个新的任务,这里面也做了一些相应的配置,
[*]
// 加入了current目录,就是初始的CrawlDb目录,设置了输入格式为SequenceFileInputFormat
[*]
// 配置了Map-Reducer为CrawlDbFilter-CrawlDbReducer
[*]
// 配置了输出格式为MapFileOutputFormat
[*]
// 还配置了输出的类型
[*]
JobConf job = CrawlDb.createJob(getConf(), crawlDb);
[*]
// 配置一些参数
[*]
job.setBoolean(CRAWLDB_ADDITIONS_ALLOWED, additionsAllowed);
[*]
job.setBoolean(CrawlDbFilter.URL_FILTERING, filter);
[*]
job.setBoolean(CrawlDbFilter.URL_NORMALIZING, normalize);
[*]
// 加入输入目录,一个是crawl_fetch,另一个是crawl_parse
[*]
for (int i = 0; i
页:
[1]