基于hadoop+nutch+solr的搜索引擎环境搭载<二>nutch+solr整合以及搭载在hadoop上

bo13608711 · 发表于 2015-7-12 07:00:27

　　官方文档：nutch+hadoop
　　nutch+solr
　　版本：
　　nutch：nutch1.6
　　solr：  solr3.6.2
　　可以参照hadoop1.0.4+nutch1.6“单机”配置

　　一，ant编译nutch
　　下载apache-nutch-1.6-src.tar.gz，解压之。
　　
　　在nutch1.6/conf下
　　先修改 nutch-default.xml中http.agent.name和http.robots.agents，value值随意，但是要保持一致
　　
　　

http.agent.name
sleeper_qp
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.

http.robots.agents
sleeper_qp
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*

　　
　　然后
　　

cp nutch-default.xml nutch-site.xml
　　在nutch1.6下ant编译
　　二，测试nutch
　　    启动hadoop，新建，上传urls.txt

~/hadoop-1.0.4$ bin/start-all.sh
~/hadoop-1.0.4$ touch urls.txt
在urls.txt写入你想爬的网站
~/hadoop-1.0.4$ bin/hadoop fs -mkdir urls
~/hadoop-1.0.4$ bin/hadoop fs -put urls.txt urls/
　　
　　
　　添加hadoop的环境变量
　　修改～/.bashrc:

export HADOOPHOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOPHOME/bin
　　说明：直接输入hadoop的命令可能会有警告，这是因为hadoop自身也配置了自己的路径(在HADOOP_HOME/bin/hadoop-config.sh)
　　
　　在nutch/runtime/deploy下输入

~/nutch1.6/runtime/deploy$ bin/nutch crawl urls -dir crawl -depth 3 -topN 3
解释
-dir是爬取内容存放的文件 -depth 爬取深度  -topN
完成后可查看hdfs
输入hadoop fs -ls查看新的crawl文件夹
　　
　　
　　三，安装solr
　　下载解压solr3.6.2
　　修改NUTCH_HOME/conf下的schema.xml

　　拷贝NUTCH_HOME/conf下的schema.xml到solr/example/solr/conf/下
　　    然后solr/example/solr/conf/下的solrconfig.xml中的str name="df"后的text全部改为content PS：因为版本的变更，默认值有text该为了content
　　    在{APACHE_SOLR_HOME}/example下输入：
　　    java -jar start.jar
　　
　　四，整合测试
　　保证正确，重启hadoop(删除前面的hdfs中的crawl)，重启solr
　　在浏览器下查看相关信息：
　　http://localhost:8983/solr/
　　http://localhost:50070
　　在~/nutch1.6/runtime/deploy下输入

bin/nutch crawl urls -solr http://localhost:8983/solr -dir crawl -depth 1 -topN 1
　　在正确运行的情况下，可以在http://localhost:8983/solr/admin/ 输入你先前爬取网站的相关内容，可以得到一个xml格式的结果
　　
　　

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] 基于hadoop+nutch+solr的搜索引擎环境搭载<二>nutch+solr整合以及搭载在hadoop上

浏览过的版块

扫码加入运维网微信交流群