大数据——nutch1.8+solr 4 配置过程+ikanalayzer2012 中文分词器

熬死你的 · 发表于 2015-11-12 08:11:07

Nutch 2.2.1目前性能没有Nutch 1.7好，参考这里，NUTCH
FIGHT! 1.7 vs 2.2.1. 所以我目前还是使用的Nutch 1.8。

1 下载已编译好的二进制包，解压

$ wget http://psg.mtu.edu/pub/apache/nutch/1.8/apache-nutch-1.8-bin.tar.gz
$ tar zxf apache-nutch-1.8-bin.tar.gz
将解压后的文件移到/usr中,存为nutch-1.8

也可下载tar.gz文件包，http://mirrors.cnnic.cn/apache/下载后解压。移到自己的安装目录：
$ sudo mv apache-nutch-1.8 /usr/nutch-1.8

2 验证一下

$ cd /usr/nutch-1.8
$ bin/nutch

如果出现”Permission denied”请运行下面的命令：

$ chmod +x bin/nutch
出现nutch使用帮助即可。

如果有Warning说 JAVA_HOME没有设置，请设置一下JAVA_HOME.jdk环境配置问题。

3 添加种子URL
　　在nutch文件夹中

mkdir urls
sudo gedit /urls/seed.txt
添加要爬取的url链接，例如 http://www.tianya.cn/

4 设置URL过滤规则

如果只想抓取某种类型的URL，可以在 conf/regex-urlfilter.txt设置正则表达式，于是，只有匹配这些正则表达式的URL才会被抓取。

例如，我只想抓取豆瓣电影的数据，可以这样设置：

#注释掉这一行
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# accept anything else
#注释掉这行
#+.
+^http:\/\/movie\.douban\.com\/subject\/[0-9]+\/(\?.+)?$
+^表示可以匹配所有url链接
　　爬虫爬取时，需要约束爬取的范围。基本所有的爬虫都是通过正则表达式来完成这个约束。
　　最简单的，正则：
http://www.xinhuanet.com/.*代表"http://www.xinhuanet.com/"后加任意个任意字符（可以是0个）。　　通过这个正则可以约束爬虫的爬取范围,但是这个正则并不是表示爬取新华网所有的网页。新华网并不是只有www.xinhuanet.com这一个域名，还有很多子域名，类似:news.xinhuanet.com

　　这个时候我们需要定义这样一个正则:
http://([a-z0-9]*\.)*xinhuanet.com/这样就可以限制爬取新华网所有的网页了。　　每种爬虫的正则约束系统都有一些区别，这里拿Nutch、WebCollector两家爬虫的正则系统做对比：
　　Nutch官网：  http://nutch.apache.org/
WebCollector官网:  http://crawlscript.github.io/WebCollector/

5 设置agent名字

conf/nutch-site.xml:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

这里的配置参考nutch-default.xml,给value赋值即可

6 安装Solr

由于建索引的时候需要使用Solr，因此我们需要安装并启动一个Solr服务器。

参考Nutch Tutorial 第4、5、6步，以及Solr
Tutorial。

6.1 下载，解压

$  wget http://mirrors.cnnic.cn/apache/lucene/solr/4.8.1/solr-4.8.1.tgz

也可以下载tar.gz文件包。http://apache.fayea.com/lucene/solr/

$ tar -zxvf solr-4.8.1.tgz

$ sudo mv solr-4.8.1 /usr/solr4.8.1

6.2 运行Solr

cd /usr/solr4.8.1/example
java -jar start.jar

验证是否启动成功

用浏览器打开

http://localhost:8983/solr/#/
，如果能看到页面，说明启动成功。

6.3 将Nutch与Solr集成在一起
　　NUTCH安装目录是：/usr/nutch1.8
　　SOLR安装目录是：/usr/solr4.8.1

将NUTCH-1.8/conf/schema-solr4.xml拷贝到SOLR_DIR/exanple/solr/collection1/conf/，重命名为schema.xml，并在<fields>...</fields>最后添加一行(具体解释见Solr
4.2 - what is _version_field?)，

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>

重启Solr，

# Ctrl+C to stop Solr
java -jar start.jar

7 使用crawl脚本一键抓取

Nutch自带了一个脚本，./bin/crawl，把抓取的各个步骤合并成一个命令，看一下它的用法

$ bin/crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

注意，是使用bin/crawl，不是bin/nutch
crawl，后者已经是deprecated的了。

7.1 抓取网页

$ ./bin/crawl ~/urls/ ./TestCrawl http://localhost:8983/solr/ 2

～/urls 是存放了种子url的目录
TestCrawl 是存放数据的根目录（在Nutch 2.x中，则表示crawlId，这会在HBase中创建一张以crawlId为前缀的表，例如TestCrawl_Webpage）
http://localhost:8983/solr/ , 这是Solr服务器
2，numberOfRounds，迭代的次数

过了一会儿，屏幕上出现了一大堆url，可以看到爬虫正在抓取！

fetching http://music.douban.com/subject/25811077/ (queue crawl delay=5000ms)
fetching http://read.douban.com/ebook/1919781 (queue crawl delay=5000ms)
fetching http://www.douban.com/online/11670861/ (queue crawl delay=5000ms)
fetching http://book.douban.com/tag/绘本 (queue crawl delay=5000ms)
fetching http://movie.douban.com/tag/科幻 (queue crawl delay=5000ms)
49/50 spinwaiting/active, 56 pages, 0 errors, 0.9 1 pages/s, 332 245 kb/s, 131 URLs in 5 queues
fetching http://music.douban.com/subject/25762454/ (queue crawl delay=5000ms)
fetching http://read.douban.com/reader/ebook/1951242/ (queue crawl delay=5000ms)
fetching http://www.douban.com/mobile/read-notes (queue crawl delay=5000ms)
fetching http://book.douban.com/tag/诗歌 (queue crawl delay=5000ms)
50/50 spinwaiting/active, 61 pages, 0 errors, 0.9 1 pages/s, 334 366 kb/s, 127 URLs in 5 queues

7.2 查看结果

$ bin/nutch readdb TestCrawl/crawldb/ -stats
14/02/14 16:35:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: TestCrawl/crawldb/
14/02/14 16:35:47 INFO crawl.CrawlDbReader: TOTAL urls:70
14/02/14 16:35:47 INFO crawl.CrawlDbReader: retry 0:70
14/02/14 16:35:47 INFO crawl.CrawlDbReader: min score:0.005
14/02/14 16:35:47 INFO crawl.CrawlDbReader: avg score:0.03877143
14/02/14 16:35:47 INFO crawl.CrawlDbReader: max score:1.23
14/02/14 16:35:47 INFO crawl.CrawlDbReader: status 1 (db_unfetched):59
14/02/14 16:35:47 INFO crawl.CrawlDbReader: status 2 (db_fetched):11
14/02/14 16:35:47 INFO crawl.CrawlDbReader: CrawlDb statistics: done

8 一步一步使用单个命令抓取网页

上一节为了简单性，一个命令搞定。本节我将严格按照抓取的步骤，一步一步来，揭开爬虫的神秘面纱。感兴趣的读者也可以看看 bin/crawl 脚本里的内容，可以很清楚的看到各个步骤。

先删除第7节产生的数据，

$ rm -rf TestCrawl/

8.1 基本概念

Nutch data is composed of:

The crawl database, or crawldb.
This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
The link database, or linkdb.
This contains the list of known links to each URL, including both the source URL and anchor text of the link.
A set of segments.
Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
- a crawl_generate names
  a set of URLs to be fetched
- a crawl_fetch contains
  the status of fetching each URL
- a content contains
  the raw content retrieved from each URL
- a parse_text contains
  the parsed text of each URL
- a parse_data contains
  outlinks and metadata parsed from each URL
- a crawl_parse contains
  the outlink URLs, used to update the crawldb

8.2 inject:使用种子URL列表，生成crawldb

$ bin/nutch inject TestCrawl/crawldb ~/urls

将根据～/urls下的种子URL，生成一个URL数据库，放在crawdb目录下。

8.3 generate

$ bin/nutch generate TestCrawl/crawldb TestCrawl/segments

这会生成一个 fetch list，存放在一个segments/日期目录下。我们将这个目录的名字保存在shell变量s1里：

$ s1=`ls -d TestCrawl/segments/2* | tail -1`
$ echo $s1

8.4 fetch

$ bin/nutch fetch $s1

将会在 $1 目录下，生成两个子目录, crawl_fetch 和 content。

8.5 parse

$ bin/nutch parse $s1

将会在 $1 目录下，生成3个子目录, crawl_parse, parse_data 和 parse_text 。

8.6 updatedb

$ bin/nutch updatedb TestCrawl/crawldb $s1

这将把crawldb/current重命名为crawldb/old，并生成新的 crawldb/current 。

8.7 查看结果

$ bin/nutch readdb TestCrawl/crawldb/ -stats

8.8 invertlinks

在建立索引之前，我们首先要反转所有的链接，这样我们就可以获得一个页面所有的锚文本，并给这些锚文本建立索引。

$ bin/nutch invertlinks TestCrawl/linkdb -dir TestCrawl/segments

8.9 solrindex, 提交数据给solr，建立索引

$ bin/nutch solrindex http://localhost:8983/solr TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/20140203004348/ -filter -normalize

8.10 solrdedup, 给索引去重

有时重复添加了数据，导致索引里有重复数据，我们需要去重，

$bin/nutch solrdedup http://localhost:8983/solr

8.11 solrclean, 删除索引

如果数据过时了，需要在索引里删除，也是可以的。

$ bin/nutch solrclean TestCrawl/crawldb/ http://localhost:8983/solr

9.solr与tomcat整合
9.1 下载tomcat安装包，点这里下载：http://tomcat.apache.org/download-70.cgi
$ tar -zxvf apache-tomcat-7.0.57.tar.gz
$ sudo mv apache-tomcat-7.0.57 /usr/tomcat
这里我的安装目录是/usr/tomcat
9.2 整合solr与tomcat
　　假定$SOLR_HOME为/usr/tomcat/solr
　　步骤1，从solr-4.8.1/dist复制solr-4.8.1.war到$SOLR_HOME下的wabapps中，并重命名为solr.war；
　　
　　步骤2，将solr-4.8.1/example/solr复制到$/usr/tomcat目录；
　　
　　步骤3，在tomcat/conf/catalina/localhost下新建solr.xml，如下：
<?xml version="1.0"
encoding="utf-8"?><Context docBase="/usr/tomcat/wabapps/solr.war"
reloadable="true" > <Environment name="solr/home"
type="java.lang.String" value="/usr/tomcat/solr" override="true" /> </Context> 　　步骤4，从solr-4.8.1/example/lib/ext复制所有的jar到tomcat/lib下，并复制solr-4.8.1/example/resources/log4j.properties到tomcat/lib下(有关日志的说明，见http://wiki.apache.org/solr/SolrLogging)，须知，solr-4.8.1.jar并没有自带日志打印组件，因此这个步骤不执行，可能引起“org.apache.catalina.core.StandardContext filterStart SEVERE: Exception starting filter SolrRequestFilter org.apache.solr.common.SolrException: Could not find necessary SLF4j logging jars.”异常；
　　
步骤五，进入到 /tomcat/solr/collection1/conf/ 目录下的solrconfig.xml文件中，修改两处，一是注释掉文件中的这一部分代码,大致可以知道,这个简单的项目用不到这些配置:[plain] view plaincopy

<span style="background-color: rgb(204, 204, 204);"> <lib dir="../../../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" />
<lib dir="../../../contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-clustering-\d.*\.jar" />
<lib dir="../../../contrib/langid/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" />
<lib dir="../../../contrib/velocity/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" />  </span>

二是配置一个数据索引文件夹,这里配置到 /tomcat/solrindex :没有solrindex记得创建[plain] view plaincopy<span style="background-color: rgb(204, 204, 204);">
<dataDir>${solr.data.dir:}</dataDir>

<dataDir>${solr.data.dir:/tomcat/solrindex}</dataDir>  </span>
步骤六，配置/usr/tomcat/webapps/solr/WEB-INF项目的web.xml,这里正确的配置为:[plain] view plaincopy

<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>/usr/tomcat/solr</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>

10.配置IK

a.下载 ikanalayzer2012:http://code.google.com/p/ik-analyzer/downloads/list
本例使用 IK Analyer 2012-FF hotfix 1
该版本可以适用 solr 4.0, 其它版本可能不兼容.
b.下载后,unzip 解压,将 jar 文件复制到 /usr/solr/example/solr-webapp/webapp/WEB-INF/lib并在 /usr/solr/example/solr-webapp/webapp/WEB-INF/ 下新建目录: classes将 stopword.dic 和 IKAnalyzer.cfg.xml 复制到其中.可以在该 xml 中配置其它的扩展词库
c.配置schema.xml文件,路径是:/usr/solr/example/solr/collection1/conf/schema.xml
在众多fieldType当中添加一条

当建立索引时，要对name字段进行分词，在schema.xml中搜索，将其中的 name字段设置：
改为： type的内容即上面刚设置的一个fieldType: text_ik。
当建立索引的时候，name字段将按IK进行分词。
d.重新启动e.查看结果

版权声明：本文为博主原创文章，未经博主允许不得转载。

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 大数据——nutch1.8+solr 4 配置过程+ikanalayzer2012 中文分词器

浏览过的版块

扫码加入运维网微信交流群