nutch+solr整合以及搭载在hadoop上

nbvf · 发表于 2015-7-18 14:05:24

nutch+solr整合以及搭载在hadoop上
　　官方文档：nutch+hadoop

　　nutch+solr

　　版本：

　　nutch：nutch1.6

　　solr： solr3.6.2

　　可以参照hadoop1.0.4+nutch1.6“单机”配置

　　一，ant编译nutch

　　下载apache-nutch-1.6-src.tar.gz，解压之。

　　

　　在nutch1.6/conf下

　　先修改 nutch-default.xml中http.agent.name和http.robots.agents，value值随意，但是要保持一致

　　

　　

http.agent.name
sleeper_qp
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.

http.robots.agents
sleeper_qp
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*

　　

　　然后

　　

cp nutch-default.xml nutch-site.xml

　　在nutch1.6下ant编译

　　二，测试nutch

　　    启动hadoop，新建，上传urls.txt

~/hadoop-1.0.4$ bin/start-all.sh
~/hadoop-1.0.4$ touch urls.txt
在urls.txt写入你想爬的网站
~/hadoop-1.0.4$ bin/hadoop fs -mkdir urls
~/hadoop-1.0.4$ bin/hadoop fs -put urls.txt urls/

　　

　　

　　添加hadoop的环境变量

　　修改～/.bashrc:

export HADOOPHOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOPHOME/bin

　　说明：直接输入hadoop的命令可能会有警告，这是因为hadoop自身也配置了自己的路径(在HADOOP_HOME/bin/hadoop-config.sh)

　　

　　在nutch/runtime/deploy下输入

~/nutch1.6/runtime/deploy$ bin/nutch crawl urls -dir crawl -depth 3 -topN 3
解释
-dir是爬取内容存放的文件 -depth 爬取深度  -topN
完成后可查看hdfs
输入hadoop fs -ls查看新的crawl文件夹

　　

　　

　　三，安装solr

　　下载解压solr3.6.2

　　修改NUTCH_HOME/conf下的schema.xml

　　拷贝NUTCH_HOME/conf下的schema.xml到solr/example/solr/conf/下

　　    然后solr/example/solr/conf/下的solrconfig.xml中的str name="df"后的text全部改为content PS：因为版本的变更，默认值有text该为了content

　　    在{APACHE_SOLR_HOME}/example下输入：

　　    java -jar start.jar

　　

　　四，整合测试

　　保证正确，重启hadoop(删除前面的hdfs中的crawl)，重启solr

　　在浏览器下查看相关信息：

　　http://localhost:8983/solr/

　　http://localhost:50070

　　在~/nutch1.6/runtime/deploy下输入

bin/nutch crawl urls -solr http://localhost:8983/solr -dir crawl -depth 1 -topN 1

　　在正确运行的情况下，可以在http://localhost:8983/solr/admin/ 输入你先前爬取网站的相关内容，可以得到一个xml格式的结果

　　

　　

基于hadoop+nutch+solr的搜索引擎环境搭载hadoop完全分布式环境搭建

摘要: hadoop完全分布式环境搭建还算那句话：能看官方文档就尽量看官方文档 hadoop1.0.4完全分布式官方文档先扯点题外话，这个项目是我的比赛项目，可能比较简陋，主要还是想学习关于hadoop和搜索引擎方面的一些知识。这一两个月看了，等一些书吧，初步了解了一下hadoop和搜索引擎。马上项目就要上交了，所以昨天把两个月的心血给删了，重新来一遍，重装系统，复习一遍之前的吧。好了，正文开始了。环境版本：ubuntu：ubuntu 12.0.4 32bithadoop：hadoop1.0.4jdk：jd阅读全文

posted @ 2013-05-05 13:54 sleeper_qp 阅读(357) | 评论 (0) 编辑

hadoop1.0.4+nutch1.6“单机”配置

摘要: ---恢复内容开始---首先声明：在能使用官方文档的情况下完成配置，就不要看网上的博客另外此篇文章是基于伪分布式hadoop1.0.4 nutch1.6 PS：没有多的机器伤不起啊hadoop的环境配置见Ubuntu12.04下安装hadoop1.0.4nutch1.6:首先说个问题就是nutch1.6源文件中没有runtime这个文件夹，所以需要用到ant 1.安装ant sudo apt-get install ant 2.下载nutch1.6.src 并ant编译解压，这里假设解压的路径为NUTCH_HOME 进入NUTCH_HOME，输入ant 等...阅读全文

posted @ 2013-04-18 08:49 sleeper_qp 阅读(815) | 评论 (3) 编辑

ubuntu下hadoop的重启后namenode无法启动的解决方法

摘要: 参考：Hadoop namenode无法启动 Hadoop Namenode不能启动（dfs/name is in an inconsistent state）原因：ubuntu每次重启都会删除/tmp中的信息，而namenode的格式化信息恰好存在/tmp下，从而导致相应信息丢失。解决办法： 1，新建tmp文件目录 sudo mkdir ~/hadoop/hadoop_tmp 2,修改hadoop/conf目录里面的core-site.xml文件  hadoop.tmp.dir

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] nutch+solr整合以及搭载在hadoop上

浏览过的版块

扫码加入运维网微信交流群