Elasticsearch 2.2.0 分词篇：中文分词

hao1nan 发表于 2019-1-29 09:31:25

　　在Elasticsearch中，内置了很多分词器（analyzers），但默认的分词器对中文的支持都不是太好。所以需要单独安装插件来支持，比较常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果还是不错的，但是目前IKAnanlyzer还不支持最新的Elasticsearch2.2.0版本，但是smartcn中文分词器默认官方支持，它提供了一个中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。但是smartcn不支持自定义词库，作为测试可先用一下。后面的部分介绍如何支持最新的版本。
　　

　　smartcn
　　安装分词：plugin install analysis-smartcn

　　卸载：plugin remove analysis-smartcn

　　

　　测试：
　　请求：POST http://127.0.0.1:9200/_analyze/

　　{
　　"analyzer": "smartcn",
　　"text": "联想是全球最大的笔记本厂商"
　　}
　　返回结果：
　　{
　　"tokens": [
　　{
　　"token": "联想",
　　"start_offset": 0,
　　"end_offset": 2,
　　"type": "word",
　　"position": 0
　　},
　　{
　　"token": "是",
　　"start_offset": 2,
　　"end_offset": 3,
　　"type": "word",
　　"position": 1
　　},
　　{
　　"token": "全球",
　　"start_offset": 3,
　　"end_offset": 5,
　　"type": "word",
　　"position": 2
　　},
　　{
　　"token": "最",
　　"start_offset": 5,
　　"end_offset": 6,
　　"type": "word",
　　"position": 3
　　},
　　{
　　"token": "大",
　　"start_offset": 6,
　　"end_offset": 7,
　　"type": "word",
　　"position": 4
　　},
　　{
　　"token": "的",
　　"start_offset": 7,
　　"end_offset": 8,
　　"type": "word",
　　"position": 5
　　},
　　{
　　"token": "笔记本",
　　"start_offset": 8,
　　"end_offset": 11,
　　"type": "word",
　　"position": 6
　　},
　　{
　　"token": "厂商",
　　"start_offset": 11,
　　"end_offset": 13,
　　"type": "word",
　　"position": 7
　　}
　　]
　　}
　　作为对比，我们看一下标准的分词的结果，在请求中巴smartcn，换成standard
　　然后看返回结果：

　　{
　　"tokens": [
　　{
　　"token": "联",
　　"start_offset": 0,
　　"end_offset": 1,
　　"type": "",
　　"position": 0
　　},
　　{
　　"token": "想",
　　"start_offset": 1,
　　"end_offset": 2,
　　"type": "",
　　"position": 1
　　},
　　{
　　"token": "是",
　　"start_offset": 2,
　　"end_offset": 3,
　　"type": "",
　　"position": 2
　　},
　　{
　　"token": "全",
　　"start_offset": 3,
　　"end_offset": 4,
　　"type": "",
　　"position": 3
　　},
　　{
　　"token": "球",
　　"start_offset": 4,
　　"end_offset": 5,
　　"type": "",
　　"position": 4
　　},
　　{
　　"token": "最",
　　"start_offset": 5,
　　"end_offset": 6,
　　"type": "",
　　"position": 5
　　},
　　{
　　"token": "大",
　　"start_offset": 6,
　　"end_offset": 7,
　　"type": "",
　　"position": 6
　　},
　　{
　　"token": "的",
　　"start_offset": 7,
　　"end_offset": 8,
　　"type": "",
　　"position": 7
　　},
　　{
　　"token": "笔",
　　"start_offset": 8,
　　"end_offset": 9,
　　"type": "",
　　"position": 8
　　},
　　{
　　"token": "记",
　　"start_offset": 9,
　　"end_offset": 10,
　　"type": "",
　　"position": 9
　　},
　　{
　　"token": "本",
　　"start_offset": 10,
　　"end_offset": 11,
　　"type": "",
　　"position": 10
　　},
　　{
　　"token": "厂",
　　"start_offset": 11,
　　"end_offset": 12,
　　"type": "",
　　"position": 11
　　},
　　{
　　"token": "商",
　　"start_offset": 12,
　　"end_offset": 13,
　　"type": "",
　　"position": 12
　　}
　　]
　　}
　　从中可以看出，基本上不能使用，就是一个汉字变成了一个词了。
　　本文由赛克蓝德(secisland)原创，转载请标明作者和出处。

　　

　　IKAnanlyzer支持2.2.0版本
　　目前github上最新的版本只支持Elasticsearch2.1.1,路径为https://github.com/medcl/elasticsearch-analysis-ik。但现在最新的Elasticsearch已经到2.2.0了所以要经过处理一下才能支持。

　　

　　1、下载源码，下载完后解压到任意目录，然后修改elasticsearch-analysis-ik-master目录下的pom.xml文件。找到行，然后把后面的版本号修改成2.2.0。
　　2、编译代码mvn package。

　　3、编译完成后会在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。

　　4、解压文件到Elasticsearch/plugins目录下。

　　5、修改配置文件增加一行：index.analysis.analyzer.ik.type : "ik"

　　6、重启Elasticsearch。

　　测试：和上面的请求一样，只是把分词替换成ik

　　返回的结果：

　　{
　　"tokens": [
　　{
　　"token": "联想",
　　"start_offset": 0,
　　"end_offset": 2,
　　"type": "CN_WORD",
　　"position": 0
　　},
　　{
　　"token": "全球",
　　"start_offset": 3,
　　"end_offset": 5,
　　"type": "CN_WORD",
　　"position": 1
　　},
　　{
　　"token": "最大",
　　"start_offset": 5,
　　"end_offset": 7,
　　"type": "CN_WORD",
　　"position": 2
　　},
　　{
　　"token": "笔记本",
　　"start_offset": 8,
　　"end_offset": 11,
　　"type": "CN_WORD",
　　"position": 3
　　},
　　{
　　"token": "笔记",
　　"start_offset": 8,
　　"end_offset": 10,
　　"type": "CN_WORD",
　　"position": 4
　　},
　　{
　　"token": "笔",
　　"start_offset": 8,
　　"end_offset": 9,
　　"type": "CN_WORD",
　　"position": 5
　　},
　　{
　　"token": "记",
　　"start_offset": 9,
　　"end_offset": 10,
　　"type": "CN_CHAR",
　　"position": 6
　　},
　　{
　　"token": "本厂",
　　"start_offset": 10,
　　"end_offset": 12,
　　"type": "CN_WORD",
　　"position": 7
　　},
　　{
　　"token": "厂商",
　　"start_offset": 11,
　　"end_offset": 13,
　　"type": "CN_WORD",
　　"position": 8
　　}
　　]
　　}
　　从中可以看出，两个分词器分词的结果还是有区别的。
　　扩展词库，在config\ik\custom下在mydict.dic中增加需要的词组，然后重启Elasticsearch，需要注意的是文件编码是UTF-8 无BOM格式编码。

　　比如增加了赛克蓝德单词。然后再次查询：

　　请求：POST http://127.0.0.1:9200/_analyze/

　　参数：

　　{
　　"analyzer": "ik",
　　"text": "赛克蓝德是一家数据安全公司"
　　}
　　返回结果：
　　{
　　"tokens": [
　　{
　　"token": "赛克蓝德",
　　"start_offset": 0,
　　"end_offset": 4,
　　"type": "CN_WORD",
　　"position": 0
　　},
　　{
　　"token": "克",
　　"start_offset": 1,
　　"end_offset": 2,
　　"type": "CN_WORD",
　　"position": 1
　　},
　　{
　　"token": "蓝",
　　"start_offset": 2,
　　"end_offset": 3,
　　"type": "CN_WORD",
　　"position": 2
　　},
　　{
　　"token": "德",
　　"start_offset": 3,
　　"end_offset": 4,
　　"type": "CN_CHAR",
　　"position": 3
　　},
　　{
　　"token": "一家",
　　"start_offset": 5,
　　"end_offset": 7,
　　"type": "CN_WORD",
　　"position": 4
　　},
　　{
　　"token": "一",
　　"start_offset": 5,
　　"end_offset": 6,
　　"type": "TYPE_CNUM",
　　"position": 5
　　},
　　{
　　"token": "家",
　　"start_offset": 6,
　　"end_offset": 7,
　　"type": "COUNT",
　　"position": 6
　　},
　　{
　　"token": "数据",
　　"start_offset": 7,
　　"end_offset": 9,
　　"type": "CN_WORD",
　　"position": 7
　　},
　　{
　　"token": "安全",
　　"start_offset": 9,
　　"end_offset": 11,
　　"type": "CN_WORD",
　　"position": 8
　　},
　　{
　　"token": "公司",
　　"start_offset": 11,
　　"end_offset": 13,
　　"type": "CN_WORD",
　　"position": 9
　　}
　　]
　　}
　　从上面的结果可以看出已经支持赛克蓝德单词了。
　　赛克蓝德(secisland)后续会逐步对Elasticsearch的最新版本的各项功能进行分析，近请期待。也欢迎加入secisland公众号进行关注。

　　

页: [1]

运维网's Archiver

Elasticsearch 2.2.0 分词篇：中文分词