CentOS 6.5+Nutch 1.7+Solr 4.7+IK 2012

trsgw · 发表于 2015-5-7 09:14:11

环境

Linux版本：CentOS 6.5
JDK版本：JDK 1.7
Nutch版本：Nutch 1.7
Solr版本：Solr 4.7
IK版本：IK-Analyzer 2012

1.安装JDK
2.安装Solr
3.为Solr配置IK分词
4.安装Nutch

内容

1.安装JDK

1.1 在/usr/下创建java/目录，下载JDK包并解压

1
2
3
4

[iyunv@localhost ~]# mkdir /usr/java
[iyunv@localhost ~]# cd /usr/java
[iyunv@localhost ~]# curl -O http://download.oracle.com/otn-p ... 75-linux-x64.tar.gz
[iyunv@localhost java]# tar –zxvf jdk-7u75-linux-x64.gz

1.2 设置环境变量

1	[iyunv@localhost java]# vi /etc/profile

添加以下内容：

1
2
3
4
5
6

#set JDK environment
JAVA_HOME=/usr/java/jdk1.7.0_75
JRE_HOME=$JAVA_HOME/jre
CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export JAVA_HOME JRE_HOMECLASS_PATH PATH

使修改生效：

1	[iyunv@localhost java]# source /etc/profile

1.3 验证

1	[iyunv@localhost java# java -version

2.安装Solr

2.1 在/usr/下创建solr目录，下载Solr安装包并解压

1
2
3
4

[iyunv@localhost ~]# mkdir /usr/solr
[iyunv@localhost ~]# cd /usr/solr
[iyunv@localhost solr]# curl -O http://archive.apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz
[iyunv@localhost solr]# tar –zxvfsolr-4.7.0.tgz

2.2 启动Jetty

这里使用Solr自带的Jetty服务器

1 2	[iyunv@localhost solr]# cd solr-4.7.0/example [iyunv@localhost example]# java -jar start.jar

2.3 验证

在浏览器输入：http://10.192.87.198:8983/solr#/collection1/query

3.为Solr配置IK分词

3.1 下载IK-Analyzer-2012

解压之后，将IKAnalyzer.cfg.xml、IKAnalyzer2012_FF.jar、stopword.dic三个文件上传到/usr/solr/solr-4.7.0/example/solr-webapp/webapp/WEB-INF/lib/目录下

3.2 修改/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml配置文件

1 2	[iyunv@localhost solr]# cd /usr/solr/solr-4.7.0/example/solr/collection1/conf/ [iyunv@localhost solr]# vi schema.xml

在<type></types>中增加如下内容：

1
2
3
4

<fieldTypename="text_ik" class="solr.TextField">
<analyzer type="index"isMaxWordLength="false"class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="query"isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>

3.3 验证

重启Solr，打开http://10.192.87.198:8983/solr/#/collection1/analysis，测试一下：

分词结果：

4.安装Nutch

4.1 在/usr/下创建nutch目录，下载Nutch安装包并解压

1
2
3
4

[iyunv@localhost ~]# mkdir /usr/nutch
[iyunv@localhost ~]# cd /usr/nutch
[iyunv@localhost nutch]# curl -O http://archive.apache.org/dist/n ... utch-1.7-bin.tar.gz
[iyunv@localhost nutch]# tar –zxvf apache-nutch-1.7-bin.tar.gz

4.2 修改nutch-site.xml配置文件

1 2	[iyunv@localhost nutch]# cd apache-nutch-1.7/conf [iyunv@localhost conf]# vi nutch-site.xml

在<configuration>..</configuration>中添加字段，如下：

1
2
3
4
5
6
7
8
9
10

<configuration>
  <property>
<name>http.agent.name</name>
<value>Friendly Crawler</value>
  </property>
  <property>
<name>parser.skip.truncated</name>
<value>false</value>
  </property>
</configuration>

4.3 修改regex-urlfilter.txt文件，设置过滤规则

1	[iyunv@localhost conf]# vi nutch-site.xml

这里是以正则表达式匹配你希望爬取的网站的地址。
如下面例子，用正则表达式来限制爬虫的范围仅限于sohu.com这个域
修改前：

1

修改后：

1	+^http://([a-z0-9]\.)sohu.com

4.4 设定所要爬取的网站

1
2
3

[iyunv@localhost conf]# cd /usr/nutch/apache-nutch-1.7
[iyunv@localhost apache-nutch-1.7]# mkdir urls
[iyunv@localhost apache-nutch-1.7]# echo "http://www.sohu.com">urls/seed.txt

4.5 执行命令，进行爬取

1	[iyunv@localhost apache-nutch-1.7]# bin/nutch crawl urls -dir crawl -depth 2 -topN 5

使用tree查看/usr/nutch/apache-nutch-1.7/crawl目录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

[iyunv@localhost apache-nutch-1.7]# tree crawl/
crawl/
├── crawldb
│ ├── current
│ │ └── part-00000
│ │    ├── data
│ │    └── index
│ └── old
│    └── part-00000
│          ├── data
│          └── index
├── linkdb
│ └── current
│    └── part-00000
│          ├── data
│          └── index
└── segments
├── 20150326234924
│ ├── content
│ │ └── part-00000
│ │    ├── data
│ │    └── index
│ ├── crawl_fetch
│ │ └── part-00000
│ │    ├── data
│ │    └── index
│ ├── crawl_generate
│ │ └── part-00000
│ ├── crawl_parse
│ │ └── part-00000
│ ├── parse_data
│ │ └── part-00000
│ │    ├── data
│ │    └── index
│ └── parse_text
│    └── part-00000
│       ├── data
│       └── index
└── 20150326234933
      ├── content
      │ └── part-00000
      │    ├── data
      │    └── index
      ├── crawl_fetch
      │ └── part-00000
      │    ├── data
      │    └── index
      ├── crawl_generate
      │ └── part-00000
      ├── crawl_parse
      │ └── part-00000
      ├── parse_data
      │ └── part-00000
      │    ├── data
      │    └── index
      └── parse_text
         └── part-00000
            ├── data
            └── index

已经爬取到数据。
4.6 集成Solr
编辑/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml文件，在<field>…</fields>中增加如下字段：

1
2
3
4
5
6
7

<fieldname="host" type="string" stored="false"indexed="true"/>
<field name="digest"type="string" stored="true" indexed="false"/>
<field name="segment"type="string" stored="true" indexed="false"/>
<field name="boost"type="float" stored="true" indexed="false"/>
<field name="tstamp"type="date" stored="true" indexed="false"/>
<field name="anchor"type="string" stored="true" indexed="true" multiValued="true"/>
<fieldname="cache" type="string" stored="true"indexed="false"/>

重启Solr，重新爬取