Elasticsearch安装ik插件

于一发表于 2017-5-20 13:18:06

　　想要给elasticsearch安装一个中文分词插件，网上的资料都有点过时。
　　现在记录一下从源码安装ik插件的过程。
　　（注：我用的版本是0.90.2)。
　　1、下载源码
　　首先去ik的git网站下站源码，网址：https://github.com/medcl/elasticsearch-analysis-ik
　　下载完源码后，发现没有对应的jar包。我用mvn package，打了一个jar包。
　　打包后名称最后是：elasticsearch-analysis-ik-1.2.2.jar
　　2、文件拷贝。
　　这一步很简单，将jar包拷贝到ES_HOME/plugin/analysis-ik目录下面。
　　将config/ik目录下面的东西拷贝纸ES_HOME/config/ik目录下面（我在本机是window，es在linux上面，我是先将文件夹打包成zip包，然后到服务器上解压)。
　　3、增加配置
　　编辑elasticsearch.xml，在文件的最后增加下面代码：

index:
analysis:
analyzer:
ik:
alias:
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
ik_max_word:
type: ik
use_smart: false
ik_smart:
type: ik
use_smart: true
　　然后重启elasitcsearch。
　　4、测试分词插件
　　这个我也不知道为啥使用下面命令不能测试。

curl 'http://localhost:9200/_analyze?analyzer=ik&pretty=true' -d'
{
"text":"去北京怎么走"
}
'
　　但是从es的日志看，插件应该已经是加载了。
　　我安装ik插件的说明创建了一个索引，然后在索引下面使用上面的查询可以。

curl -XPUT http://localhost:9200/index
curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
"fulltext": {
"_all": {
"indexAnalyzer": "ik",
"searchAnalyzer": "ik",
"term_vector": "no",
"store": "false"
},
"properties": {
"content": {
"type": "string",
"store": "no",
"term_vector": "with_positions_offsets",
"indexAnalyzer": "ik",
"searchAnalyzer": "ik",
"include_in_all": "true",
"boost": 8
}
}
}
}'
//测试命令
curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'
{
"text":"去北京怎么走"
}
'
　　测试分词效果如下：

{
"text":"去北京怎么走"
}
'
{
"tokens" : [ {
"token" : "text",
"start_offset" : 4,
"end_offset" : 8,
"type" : "ENGLISH",
"position" : 1
}, {
"token" : "去",
"start_offset" : 11,
"end_offset" : 12,
"type" : "CN_CHAR",
"position" : 2
}, {
"token" : "北京",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "怎么走",
"start_offset" : 14,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 4
} ]
}
　　5、补充
　　当测试分词“中华人民共和国时"，发现竟然没有分词。如下：

curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'
> {
> "text":"中华人民共和国"
> }
> '
{
"tokens" : [ {
"token" : "text",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 1
}, {
"token" : "中华人民共和国",
"start_offset" : 19,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 2
} ]
}
　　但这并非我们想要的结果，难道ik这么差，不会分词了？后来经过研究，发现ik有一个smart模式，并且默认是这个模式，在这种模式下，你搜索“中华人民共和国"，可能就搜不到仅包含“共和国"的文档。只需使用ik_max_word模式即可修复以上问题，关于分词器，继续探索中....。

curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'
> {
> "text":"中华人民共和国"
> }
> '
{
"tokens" : [ {
"token" : "text",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 1
}, {
"token" : "中华人民共和国",
"start_offset" : 19,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "中华人民",
"start_offset" : 19,
"end_offset" : 23,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "中华",
"start_offset" : 19,
"end_offset" : 21,
"type" : "CN_WORD",
"position" : 4
}, {
"token" : "华人",
"start_offset" : 20,
"end_offset" : 22,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "人民共和国",
"start_offset" : 21,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "人民",
"start_offset" : 21,
"end_offset" : 23,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "共和国",
"start_offset" : 23,
"end_offset" : 26,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "共和",
"start_offset" : 23,
"end_offset" : 25,
"type" : "CN_WORD",
"position" : 9
}, {
"token" : "国",
"start_offset" : 25,
"end_offset" : 26,
"type" : "CN_CHAR",
"position" : 10
} ]
}
　　请支持原创：

http://donlianli.iteye.com/blog/1948841

对这类话题感兴趣？欢迎发送邮件至donlianli@126.com

关于我：邯郸人，擅长Java，Javascript，Extjs，oracle sql。

更多我之前的文章，可以访问我的空间

页: [1]

运维网's Archiver

Elasticsearch安装ik插件