Apache Nutch 1.3 学习笔记八(LinkDb)
这里主要是分析一下org.apache.nutch.crawl.LinkDb,它主要是用计算反向链接。
1. 运行命令 bin/nutch invertlinks
帮助参数说明:
[*]
Usage: LinkDb (-dir | ...) [-force] [-noNormalize] [-noFilter]
[*]
linkdboutput LinkDb to create or update
[*]
-dir segmentsDir parent directory of several segments, OR
[*]
seg1 seg2 ... list of segment directories
[*]
-forceforce update even if LinkDb appears to be locked (CAUTION advised)
[*]
-noNormalize don't normalize link URLs
[*]
-noFilter don't apply URLFilters to link URLs
本地的运行结果为:
[*]
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch invertlinks db/linkdb/ db/segments/20110822105243/
[*]
LinkDb: starting at 2011-08-29 09:21:36
[*]
LinkDb: linkdb: db/linkdb
[*]
LinkDb: URL normalize: true
[*]
LinkDb: URL filter: true
[*]
LinkDb: adding segment: db/segments/20110822105243 // 加入新的segment库
[*]
LinkDb: merging with existing linkdb: db/linkdb // 与原因的库进行合并
[*]
LinkDb: finished at 2011-08-29 09:21:40, elapsed: 00:00:03
2. LinkDb主要源代码分析
在LinkDb主要是调用一个invert方法,这个方法做了两件事,
+ 分析新输入的segment目录,产生新的反向链接库
+ 把新产生的反向链接库与原来的库进行合并
2.1 分析新输入的segment目录,主要代码如下:
[*]
// 新建立一个MP任务
[*]
JobConf job = LinkDb.createJob(getConf(), linkDb, normalize, filter);
[*]
// 添加目录到输入路径,这里可能有多个输入路径, parse_data
[*]
for (int i = 0; i
页:
[1]