赤色烙印 发表于 2015-8-2 10:31:32

Apache Nutch 1.3 学习笔记四(SegmentReader分析)

  
  前面我们看了一下Generate的流程,它是为Fetch产生相应的fetchlist,这里想介绍一下Segment的查看工具SegmentReader类。

1. 命令介绍
  

[*]
bin/nutch readseg
[*]
    Usage: SegmentReader (-dump ... | -list ... | -get ...)
  
// 这里是一般的参数说明
  

[*]
* General options:
[*]
      -nocontentignore content directory
[*]
      -nofetch    ignore crawl_fetch directory
[*]
      -nogenerate ignore crawl_generate directory
[*]
      -noparse    ignore crawl_parse directory
[*]
      -noparsedata    ignore parse_data directory
[*]
      -noparsetext    ignore parse_text directory
  
// 这里用于下载segment的内容,把其转换成文本格式,后面可以加General options参数,
// 看是不是过滤相应的目录
  

[*]
* SegmentReader -dump
[*]
      Dumps content of a as a text file to .
[*]
[*]
[*]
       name of the segment directory.
[*]
      name of the (non-existent) output directory.
  
// 列出相应的segment信息
  

[*]
* SegmentReader -list ( ... | -dir )
[*]
      List a synopsis of segments in specified directories, or all segments in
[*]
      a directory , and print it on System.out
[*]
[*]
[*]
       ...    list of segment directories to process
[*]
      -dir        directory that contains multiple segments
  
// 得到相对应的url的信息
  

[*]
* SegmentReader -get
[*]
      Get a specified record from a segment, and print it on System.out.
[*]
[*]
[*]
       name of the segment directory.
[*]
          value of the key (url).
[*]
            Note: put double-quotes around strings with spaces.
  
2. 每一个命令的运行结果
2.1 bin/nutch readseg -dump
  在本地运行其命令的结果如下:
// 下载一个segment
  

[*]
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch readseg -dump db/segments/20110822105243/ output
[*]
    SegmentReader: dump segment: db/segments/20110822105243
[*]
    SegmentReader: done
  
      // 下载目录浏览
  

[*]
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ ls output
[*]
    dump
  
// 输出一部分下载信息
  

[*]
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ head output/dump   
[*]
    Recno:: 0
[*]
    URL:: http://baike.baidu.com/
[*]
    CrawlDatum::
[*]
    Version: 7
[*]
    Status: 67 (linked)
[*]
    Fetch time: Mon Aug 22 10:58:21 EDT 2011
[*]
    Modified time: Wed Dec 31 19:00:00 EST 1969
[*]
    Retries since fetch: 0
  
  我们来看一下其源代码是怎么写的,这个shell命令最终是调用org.apache.nutch.segment.SegmentReader中的dump方法,如下是这个方法的主要源代码:
  

[*]
// 这里生成一个MP任务
[*]
JobConf job = createJobConf();
[*]
   job.setJobName("read " + segment);
[*]
[*]
[*]
// 查看General Options的参数,是否过滤相应的目录
[*]
   if (ge) FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME));
[*]
   if (fe) FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));
[*]
   if (pa) FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));
[*]
   if (co) FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
[*]
   if (pd) FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
[*]
   if (pt) FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));
[*]
[*]
[*]
// 输入的目录文件格式,这里是SequenceFileInputFormat
[*]
   job.setInputFormat(SequenceFileInputFormat.class);
[*]
// 相应的Map与Reducer操作
[*]
   job.setMapperClass(InputCompatMapper.class);// 这里主要是把key转成UTF8格式,
[*]
   job.setReducerClass(SegmentReader.class);      // 把相应的value对象反序列化成Text类型
[*]
[*]
[*]
   Path tempDir = new Path(job.get("hadoop.tmp.dir", "/tmp") + "/segread-" + new java.util.Random().nextInt());
[*]
   fs.delete(tempDir, true);
[*]
[*]
   FileOutputFormat.setOutputPath(job, tempDir);   // 输出目录
[*]
   job.setOutputFormat(TextOutputFormat.class);   // output text
[*]
   job.setOutputKeyClass(Text.class);
[*]
   job.setOutputValueClass(NutchWritable.class);// 输出的value类型,这里要注意一下,因为上面Reducer是SegmentReader,其输出的类型为,而这里的value类型为NutchWritable,这里使用了强制类型转换。不知道这么做是为什么?
[*]
[*]
[*]
   JobClient.runJob(job);
[*]
[*]
[*]
   // concatenate the output
[*]
   Path dumpFile = new Path(output, job.get("segment.dump.dir", "dump"));
[*]
[*]
[*]
   // remove the old file
[*]
   fs.delete(dumpFile, true);
[*]
   FileStatus[] fsfstats = fs.listStatus(tempDir, HadoopFSUtil.getPassAllFilter());
[*]
   Path[] files = HadoopFSUtil.getPaths(fstats);
[*]
[*]
[*]
   PrintWriter writer = null;
[*]
   int currentRecordNumber = 0;
[*]
// 这里主要是合并上面的临时文件到正式的目录文件中output/dump
[*]
// 并且加一些格式信息,使用append方法
[*]
   if (files.length > 0) {
[*]
    // create print writer with format
[*]
   writer = new PrintWriter(new BufferedWriter(new OutputStreamWriter(fs.create(dumpFile))));
[*]
   try {
[*]
       for (int i = 0; i
页: [1]
查看完整版本: Apache Nutch 1.3 学习笔记四(SegmentReader分析)