设为首页 收藏本站
查看: 989|回复: 0

[经验分享] nutch 分布式索引(爬虫)

[复制链接]

尚未签到

发表于 2016-12-13 11:20:22 | 显示全部楼层 |阅读模式
  其实,全网抓取比intranet区别再于,
  前者提供了较为多的urls入口,
    没有使用crawl-urlfilter.txt 中并没有限制哪些url,(如果没有使用crawl命令)
  并通过逐步处理的方式得以可按的局面;
  在1.3,还有此区别,
  如默认的fetcher.parse是false,使得每次fetch后必须有一个parse step,刚开始老是看不懂为什么tutorial中这样做。。
  其次是,此版本不再有crawl-urlfiter.txt,而是用regex-urlfilter.txt替换。
  在recrawl时的区别见nutch 数据增量更新
  其实这个过程可以说是nutch对hadoop的利用的最深体会,我是这样认为的。想想看,当初,hadoop是内嵌在Nutch中,作为其中的一个功能模块。现在版本的nutch将hadop分离出来,但在分布式抓取时又得将它(配置文件,jar等)放回Nutch下。刚开始时老想nutch怎样结合hadoop进行分布式抓取;但分布式搜索还是有些不一样的,因为后者即使也是分布式,但它利用的hdfs对nutch是透明的。
  install processes:
  a.configure hadoop to run on cluster mode;
  b.put all the config files belong hadoop(master and slaves) to conf dir of nutch(s) respectively;
  c.execute the crawl command (SHOULD use individual commandsto do INSTEAD OF 'craw',as 'crawl' is used for intranet usually)
  here are the jobs belong this step:
Available Jobs
Job tracker Host NameJob tracker Start timeJob IdNameUser
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0001inject crawl-urlhadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0002crawldb crawl/dist/crawldbhadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0003generate: select from crawl/dist/crawldbhadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0004generate: partition crawl/dist/segments/2011110720hadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0005fetch crawl/dist/segments/20111107205746hadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0006crawldb crawl/dist/crawldb(update db actually)
hadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0007linkdb crawl/dist/linkdbhadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0008index-lucene crawl/dist/indexeshadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0009dedup 1: urls by timehadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0010dedup 2: content by hashhadoop
masterMon Nov 07 20:50:54 CST 2011job_201111072050_0011  dedup 3: delete from index(es)
hadoop
  * the jobs above with same color is ONE step beong the crawl command;
  * job 2 :将sortjob結果作为输入(与已有的current数据合并),生成新的crawldb;所以可以有重复的urls,在reduce中会去重 ?
  * job 4:由于存在多台crawlers,所以需要利用partition来划分urls(by host by default),来避免重复让一台机来抓取 ;
  here is the output of resulst:
  hadoop@leibnitz-laptop:/xxxxxxxxx$ hadoop fs -lsr crawl/dist/
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000
-rw-r--r--   2 hadoop supergroup       6240 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/data
-rw-r--r--   2 hadoop supergroup        215 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001
-rw-r--r--   2 hadoop supergroup       7779 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/data
-rw-r--r--   2 hadoop supergroup        218 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:07 /user/hadoop/crawl/dist/index
-rw-r--r--   2 hadoop supergroup        369 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdt
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fnm
-rw-r--r--   2 hadoop supergroup       1836 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.frq
-rw-r--r--   2 hadoop supergroup         14 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.nrm
-rw-r--r--   2 hadoop supergroup       4922 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.prx
-rw-r--r--   2 hadoop supergroup        171 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tii
-rw-r--r--   2 hadoop supergroup      11234 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tis
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments.gen
-rw-r--r--   2 hadoop supergroup        284 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments_2

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000
-rw-r--r--   2 hadoop supergroup        223 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdt
-rw-r--r--   2 hadoop supergroup         12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fnm
-rw-r--r--   2 hadoop supergroup        991 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.frq
-rw-r--r--   2 hadoop supergroup          9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.nrm
-rw-r--r--   2 hadoop supergroup       2813 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.prx
-rw-r--r--   2 hadoop supergroup        100 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tii
-rw-r--r--   2 hadoop supergroup       5169 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tis
-rw-r--r--   2 hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/index.done
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments.gen
-rw-r--r--   2 hadoop supergroup        240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments_2
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001
-rw-r--r--   2 hadoop supergroup        150 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdt
-rw-r--r--   2 hadoop supergroup         12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fnm
-rw-r--r--   2 hadoop supergroup        845 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.frq
-rw-r--r--   2 hadoop supergroup          9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.nrm
-rw-r--r--   2 hadoop supergroup       2109 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.prx
-rw-r--r--   2 hadoop supergroup        106 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tii
-rw-r--r--   2 hadoop supergroup       6226 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tis
-rw-r--r--   2 hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/index.done
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments.gen
-rw-r--r--   2 hadoop supergroup        240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments_2

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000
-rw-r--r--   2 hadoop supergroup       8131 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/data
-rw-r--r--   2 hadoop supergroup        215 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001
-rw-r--r--   2 hadoop supergroup      11240 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/data
-rw-r--r--   2 hadoop supergroup        218 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000
-rw-r--r--   2 hadoop supergroup      13958 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001
-rw-r--r--   2 hadoop supergroup       6908 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000
-rw-r--r--   2 hadoop supergroup        255 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001
-rw-r--r--   2 hadoop supergroup        266 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate
-rw-r--r--   2 hadoop supergroup        255 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00000
-rw-r--r--   2 hadoop supergroup         86 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00001

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse
-rw-r--r--   2 hadoop supergroup       6819 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00000
-rw-r--r--   2 hadoop supergroup       8302 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00001

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000
-rw-r--r--   2 hadoop supergroup       2995 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001
-rw-r--r--   2 hadoop supergroup       1917 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000
-rw-r--r--   2 hadoop supergroup       3669 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001
-rw-r--r--   2 hadoop supergroup       2770 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/index
  从以上分析可得,除了merged index外,其它目录都存在两份-对应两台crawlers.
  利用这两份索引 ,就可以实现分布式搜索了。
  

  剩下问题:为什么网上介绍的分步方式都没有使用dedup命令?
  从  nutch 数据增量更新   上可知, 分布式抓取也应该使用这个dedup命令。

  see also
  http://wiki.apache.org/nutch/NutchTutorial
  http://mr-lonely-hp.iyunv.com/blog/1075395

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-313727-1-1.html 上篇帖子: 知识管理 下篇帖子: 初识hadoop00
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表