lilingjie2015 发表于 2019-1-29 11:42:17

hanlp for elasticsearch(基于hanlp的es分词插件)

  摘要:elasticsearch是使用比较广泛的分布式搜索引擎,es提供了一个的单字分词工具,还有一个分词插件ik使用比较广泛,hanlp是一个自然语言处理包,能更好的根据上下文的语义,人名,地名,组织机构名等来切分词
  elasticsearch-analysis-hanlp插件地址:https://github.com/pengcong90/elasticsearch-analysis-hanlp
  Elasticsearch
  默认分词
http://i2.运维网.com/images/blog/201811/03/a1e0d6e635edcbd8f427f43adbadc63d.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  输出:
http://i2.运维网.com/images/blog/201811/03/a2535ff36d2b3d2ebb46a1213b40659d.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  IK分词
http://i2.运维网.com/images/blog/201811/03/d99748e9f09cbf4f84ab2213c693df8d.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  输出:
http://i2.运维网.com/images/blog/201811/03/82c02d9f21806ae2096bb0e232566401.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
http://i2.运维网.com/images/blog/201811/03/7f91b90dcef32297d91beb8286109d87.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  hanlp分词
http://i2.运维网.com/images/blog/201811/03/edcb728ce6e8ddeefc8adaa91c7bdf35.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  输出:
http://i2.运维网.com/images/blog/201811/03/16b2b6f52753fc6dcf2a35ddfe755c8e.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
http://i2.运维网.com/images/blog/201811/03/05312da267c4e8197c75ce7d2aff3648.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=
  ik分词没有根据句子的含义来分词,hanlp能根据语义正确的切分出词
  安装步骤:
  1、进入https://github.com/pengcong90/elasticsearch-analysis-hanlp,下载插件并解压到es的plugins目录下,修改analysis-hanlp目录下的hanlp.properties文件,修改root的属性,值为analysis-hanlp下的data
  目录的地址
  2、修改es config目录下的jvm.options文件,最后一行添加
  -Djava.security.policy=../plugins/analysis-hanlp/plugin-security.policy
  重启es
  GET /_analyze?analyzer=hanlp-index&pretty=true
  {
  “text”:”张柏芝士蛋糕店”
  }
  测试是否安装成功
  analyzer有hanlp-index(索引模式)和hanlp-smart(智能模式)
  自定义词典
  修改plugins/analysis-hanlp/data/dictionary/custom下的 我的词典.txt文件
  格式遵从[单词] [词性A]
  修改完后删除同目录下的CustomDictionary.txt.bin文件
  重启es服务
  文章来源于pengcong90的博客

页: [1]
查看完整版本: hanlp for elasticsearch(基于hanlp的es分词插件)