设为首页 收藏本站
查看: 933|回复: 0

[经验分享] Apache Nutch(一)

[复制链接]
累计签到:1 天
连续签到:1 天
发表于 2015-7-31 10:05:41 | 显示全部楼层 |阅读模式
  Nutch 当前两个版本 :


  • 1.6 - Nutch1.6使用Hadoop Distributed File System (HDFS)来作为存储,稳定可靠。
  • 2.1 - 通过gora对存储层进行了扩展,可以选择使用HBase、Accumulo、Cassandra 、MySQL 、DataFileAvroStore、AvroStore中任何一种来存储数据,但其中一些并不成熟。
  在Linux(Centos)上搭建 Nutch 框架:


  • 安装 svn


    yum install subversion
  • 安装 ant


    yum install ant
  • check out nutch(进入 http://nutch.apache.org ,在 Version Control 板块可查看到svn地址。)


    svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
  • ant 构建 nutch


    cd release-1.6/
    ant
  ant 构建完成之后,在 release-1.6 目录下生成两个目录 :build、runtime,进入 runtime ,有两个子文件夹 :deploy、local,分别代表了nutch两种运行方式 :


  • deploy - hadoop 运行
  • local - 本地文件系统运行,只能有一个Map和Reduce。
  local/bin/nutch :分析nutch脚本是入门的重点,可以看到通过 nutch 脚本连接Hadoop与Nutch,把apache-nutch-1.6.job提交给Hadoop的JobTracker;同时也可以看到在命令中所指定的是哪个Java类。


DSC0000.gif DSC0001.gif Nutch 脚本


#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# The Nutch command script
#
# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#                   Default is 1000.
#
#   NUTCH_OPTS      Extra Java runtime options.
#
cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac
# resolve links - $0 may be a softlink
THIS="$0"
while [ -h "$THIS" ]; do
ls=`ls -ld "$THIS"`
link=`expr "$ls" : '.*-> \(.*\)$'`
if expr "$link" : '.*/.*' > /dev/null; then
THIS="$link"
else
THIS=`dirname "$THIS"`/"$link"
fi
done
# if no args specified, show usage
if [ $# = 0 ]; then
echo "Usage: nutch COMMAND"
echo "where COMMAND is one of:"
echo "  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)"
echo "  readdb            read / dump crawl db"
echo "  mergedb           merge crawldb-s, with optional filtering"
echo "  readlinkdb        read / dump link db"
echo "  inject            inject new urls into the database"
echo "  generate          generate new segments to fetch from crawl db"
echo "  freegen           generate new segments to fetch from text files"
echo "  fetch             fetch a segment's pages"
echo "  parse             parse a segment's pages"
echo "  readseg           read / dump segment data"
echo "  mergesegs         merge several segments, with optional filtering and slicing"
echo "  updatedb          update crawl db from segments after fetching"
echo "  invertlinks       create a linkdb from parsed segments"
echo "  mergelinkdb       merge linkdb-s, with optional filtering"
echo "  solrindex         run the solr indexer on parsed segments and linkdb"
echo "  solrdedup         remove duplicates from solr"
echo "  solrclean         remove HTTP 301 and 404 documents from solr"
echo "  parsechecker      check the parser for a given url"
echo "  indexchecker      check the indexing filters for a given url"
echo "  domainstats       calculate domain statistics from crawldb"
echo "  webgraph          generate a web graph from existing segments"
echo "  linkrank          run a link analysis program on the generated web graph"
echo "  scoreupdater      updates the crawldb with linkrank scores"
echo "  nodedumper        dumps the web graph's node scores"
echo "  plugin            load a plugin and run one of its classes main()"
echo "  junit             runs the given JUnit test"
echo " or"
echo "  CLASSNAME         run the class named CLASSNAME"
echo "Most commands print help when invoked w/o parameters."
exit 1
fi
# get arguments
COMMAND=$1
shift
# some directories
THIS_DIR=`dirname "$THIS"`
NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd`
# some Java parameters
if [ "$NUTCH_JAVA_HOME" != "" ]; then
#echo "run java in $NUTCH_JAVA_HOME"
JAVA_HOME=$NUTCH_JAVA_HOME
fi
if [ "$JAVA_HOME" = "" ]; then
echo "Error: JAVA_HOME is not set."
exit 1
fi
local=true
# NUTCH_JOB
if [ -f ${NUTCH_HOME}/*nutch*.job ]; then
local=false
for f in $NUTCH_HOME/*nutch*.job; do
NUTCH_JOB=$f;
done
fi
# cygwin path translation
if $cygwin; then
NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`
fi
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
#echo "run with heapsize $NUTCH_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# so that filenames w/ spaces are handled correctly in loops below
IFS=
# add libs to CLASSPATH
if $local; then
for f in $NUTCH_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# local runtime
# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi
fi
# cygwin path translation
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
# setup 'java.library.path' for native-hadoop code if necessary
# used only in local mode
JAVA_LIBRARY_PATH=''
if [ -d "${NUTCH_HOME}/lib/native" ]; then
JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
if [ -d "${NUTCH_HOME}/lib/native" ]; then
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
else
JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
fi
fi
fi
if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then
JAVA_LIBRARY_PATH=`cygpath -p -w "$JAVA_LIBRARY_PATH"`
fi
# restore ordinary behaviour
unset IFS
# default log directory & file
if [ "$NUTCH_LOG_DIR" = "" ]; then
NUTCH_LOG_DIR="$NUTCH_HOME/logs"
fi
if [ "$NUTCH_LOGFILE" = "" ]; then
NUTCH_LOGFILE='hadoop.log'
fi
#Fix log path under cygwin
if $cygwin; then
NUTCH_LOG_DIR=`cygpath -p -w "$NUTCH_LOG_DIR"`
fi
NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR"
NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE"
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
fi
# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawl
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "mergedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexer
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "solrclean" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrClean
elif [ "$COMMAND" = "parsechecker" ] ; then
CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "domainstats" ] ; then
CLASS=org.apache.nutch.util.domain.DomainStatistics
elif [ "$COMMAND" = "webgraph" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.WebGraph
elif [ "$COMMAND" = "linkrank" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.LinkRank
elif [ "$COMMAND" = "scoreupdater" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater
elif [ "$COMMAND" = "nodedumper" ] ; then
CLASS=org.apache.nutch.scoring.webgraph.NodeDumper
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "junit" ] ; then
CLASSPATH=$CLASSPATH:$NUTCH_HOME/test/classes/
CLASS=junit.textui.TestRunner
else
CLASS=$COMMAND
fi
# distributed mode
EXEC_CALL="hadoop jar $NUTCH_JOB"
if $local; then
EXEC_CALL="$JAVA $JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"
else
# check that hadoop can be found on the path
if [ $(which hadoop | wc -l ) -eq 0 ]; then
echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
exit -1;
fi
fi
# run it
exec $EXEC_CALL $CLASS "$@"
  nutch 的所有参数



[iyunv@localhost local]# bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
parse             parse a segment's pages
readseg           read / dump segment data
mergesegs         merge several segments, with optional filtering and slicing
updatedb          update crawl db from segments after fetching
invertlinks       create a linkdb from parsed segments
mergelinkdb       merge linkdb-s, with optional filtering
solrindex         run the solr indexer on parsed segments and linkdb
solrdedup         remove duplicates from solr
solrclean         remove HTTP 301 and 404 documents from solr
parsechecker      check the parser for a given url
indexchecker      check the indexing filters for a given url
domainstats       calculate domain statistics from crawldb
webgraph          generate a web graph from existing segments
linkrank          run a link analysis program on the generated web graph
scoreupdater      updates the crawldb with linkrank scores
nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
junit             runs the given JUnit test
or
CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


[iyunv@localhost local]# bin/nutch crawl
Usage: Crawl  -solr  [-dir d] [-threads n] [-depth i] [-topN N]
  参数的意义:


  • urlDir - 种子url的目录地址
  • -solr - 为solr的地址(如果没有则为空)
  • -dir - 保存爬取文件的目录
  • -threads - 爬取线程数量(默认10)
  • -depth - 爬取深度 (默认5)
  • -topN - 访问的广度 (默认是Long.max)
  配置 local/conf/nutch-site.xml
  Nutch 的提高在于研读nutch-default.xml中每一个配置的实际含义,需要结合源代码理解。打开 local/conf/nutch-default.xml,找到 :




http.agent.name

HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.


  将以上配置复制到 nutch-site.xml 的  中,http.agent.name 的value值()是基于浏览器的User-Agent( 用户代理 ),它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等,如:Opera/9.80 (Windows NT 5.1; Edition IBIS) Presto/2.12.388 Version/12.15。这个是Nutch服从Robot协议,所以要改。
  添加种子url
  在local目录下建文件夹如urls,在urls里面建文件如url,里面加入你要爬取的网站的入口url,如 :http://www.163.com/
  配置local/conf/regex-urlfilter.txt
  打开local/conf/regex-urlfilter.txt,注释掉最后一行,并添上你要抓取的网站的域名 :



# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
# +.
+^http://([a-z0-9]*\.)*163\.com/
  现在就可以爬取163所有的网页了, 在local目录下新建文件夹data,保存爬取内容,选择合适的参数:




nohup bin/nutch crawl urls -dir data &
  nohup 命令将把输出的信息附加到的 nohup.out 文件中;在执行 nutch 会把爬虫的记录生成到 local/logs/hadoop.log
  在爬取完成后,在 data 的文件夹下会有三个文件夹crawldb、linkdb、segments :


  • crawldb - 是所有需要爬取的超链接
  • Linkdb - 存放的是所有超连接及其每个连接的链入地址和锚文本
  • segments - 存放的是抓取的页面,以爬取的时间命名,个数不多于爬取的深度,Nutch的爬取策略是广度优先,每一层url生成一个文件夹,直到没有新的url。
  在segments有6个文件夹 :


  • crawl_generate - names a set of urls to be fetched(待爬取的url)
  • crawl_fetch - contains the status of fetching each url(爬取的url的状态)
  • content - contains the content of each url(页面内容)
  • parse_text - contains the parsed text of each url(网页的文本信息)
  • parse_data - contains outlinks and metadata parsed from each url(url解析出来的外链和元数据)
  • crawl_parse - contains the outlink urls, used to update the crawldb(更新crawldb的外链)
  这些文件夹都是不可读的,以方便存取并在高一层进行检索用。如果想看到具体内容,要使用Nutch定义的读取命令 :

  1、查看CrawlDB(readdb)



[iyunv@localhost local]# bin/nutch readdb
Usage: CrawlDbReader  (-stats | -dump  | -topN   [] | -url )
    directory name where crawldb is located
-stats [-sort]     print overall statistics to System.out
[-sort]    list status sorted by host
-dump  [-format normal|csv|crawldb]    dump the whole db to a text file in
[-format csv]    dump in Csv format
[-format normal]    dump in standard format (default option)
[-format crawldb]    dump as CrawlDB
[-regex ]    filter records with expression
[-status ]    filter records by CrawlDatum status
-url     print information on  to System.out
-topN   []    dump top  urls sorted by score to
[]    skip records with scores below this value.
This can significantly improve performance.
  查看URL地址总数和它的状态及评分 :



[iyunv@localhost local]# bin/nutch readdb data/crawldb/ -stats
CrawlDb statistics start: data/crawldb/
Statistics for CrawlDb: data/crawldb/
TOTAL urls:    10635
retry 0:    10615
retry 1:    20
min score:    0.0
avg score:    2.6920545E-4
max score:    1.123
status 1 (db_unfetched):    9614
status 2 (db_fetched):    934
status 3 (db_gone):    2
status 4 (db_redir_temp):    81
status 5 (db_redir_perm):    4
CrawlDb statistics: done
  导出每个url地址的详细内容:bin/nutch readdb data/crawldb/ -dump crawldb(导出的地址)
  2、查看linkdb
  查看链接情况:bin/nutch readlinkdb data/linkdb/ -url http://www.163.com/
导出linkdb数据库文件:bin/nutch readlinkdb 163/linkdb/ -dump linkdb(导出的地址)
  3、查看segments
  bin/nutch readseg -list -dir data/segments/  -  可以看到每一个segments的名称,产生的页面数,抓取的开始时间和结束时间,抓取数和解析数。



[iyunv@localhost local]# bin/nutch readseg -list -dir data/segments/
NAME              GENERATED    FETCHER START          FETCHER END            FETCHED    PARSED
20130427150144    53           2013-04-27T15:01:52    2013-04-27T15:05:15    53         51
20130427150553    1036         2013-04-27T15:06:01    2013-04-27T15:58:09    1094       921
20130427150102    1            2013-04-27T15:01:10    2013-04-27T15:01:10    1          1
  导出segments :bin/nutch readseg -dump data/segments/20130427150144 segdb
其中data/segments/20130427150144 为一个segments文件夹,segdb为存放转换后的内容的文件夹。
  最后一个命令可能是最有用的,用于获得页面内容,一般会加上几个选项
bin/nutch readseg -dump  data/segments/20130427150144/ data_oscar /segments -nofetch -nogenerate -noparse -noparsedata -nocontent
这样得到的 dump文件只包含网页的正文信息,没有标记。
  
  
  感谢:http://yangshangchuan.iteye.com
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-92621-1-1.html 上篇帖子: Tomcat与Apache HTTPD的集成 下篇帖子: WampServer 给电脑搭建apache服务器和php环境
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表