Apache Nutch 1.3 学习笔记七（CrawlDb

hc6538 发表于 2015-8-3 09:36:24

　　
　　这里主要看一下CrawlDb中的updatedb，它主要是用来更新CrawlDb数据库的
1. bin/nutch updatedb
　　我们用nutch的命令行时会看到一个方法叫updatedb，其实这个方法就是调用CrawlDb.java类中的update方法，它的参数帮助如下：
　　

[*]
Usage: CrawlDb (-dir | ...) [-force] [-normalize] [-filter] [-noAdditions]
[*]
crawldb CrawlDb to update
[*]
-dir segments parent directory containing all segments to update from
[*]
seg1 seg2 ... list of segment names to update from
[*]
-forceforce update even if CrawlDb appears to be locked (CAUTION advised)
[*]
-normalizeuse URLNormalizer on urls in CrawlDb and segment (usually not needed)
[*]
-filter use URLFilters on urls in CrawlDb and segment
[*]
-noAdditions only update already existing URLs, don't add any newly discovered URLs
　　
2. 下面我们来分析一下其update方法到底做了些什么
2.1 update的任务提交参数，部分代码如下
　　

[*]
// 生成一个新的任务,这里面也做了一些相应的配置，
[*]
// 加入了current目录，就是初始的CrawlDb目录，设置了输入格式为SequenceFileInputFormat
[*]
// 配置了Map-Reducer为CrawlDbFilter-CrawlDbReducer
[*]
// 配置了输出格式为MapFileOutputFormat
[*]
// 还配置了输出的类型
[*]
JobConf job = CrawlDb.createJob(getConf(), crawlDb);
[*]
// 配置一些参数
[*]
job.setBoolean(CRAWLDB_ADDITIONS_ALLOWED, additionsAllowed);
[*]
job.setBoolean(CrawlDbFilter.URL_FILTERING, filter);
[*]
job.setBoolean(CrawlDbFilter.URL_NORMALIZING, normalize);
[*]
// 加入输入目录，一个是crawl_fetch,另一个是crawl_parse
[*]
for (int i = 0; i

页: [1]

运维网's Archiver

Apache Nutch 1.3 学习笔记七（CrawlDb