solr4.4 + mmseg4j-1.9.1中文分词

jiay · 发表于 2016-12-17 07:30:30

1、solr配置请参考solr4.4.0配置笔记.txt
2、mmseg4j-1.9.1下载地址 http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.zip
mmseg4j 1.8.3 只支持 lucene 2.9/3.0 接口和 solr1.4。其它没改动
mmseg4j 1.8.5 支持 lucene 3.1, solr3.1
mmseg4j 1.9.0 支持 lucene 4.0, solr4.0
3、在E:\private_project\solr\solr_home\solr文件夹下建立lib和dic两个文件夹
4、解压mmseg4j-1.9.1.zip，并将mmseg4j-1.9.1\dist文件夹下的3个jar复制到刚刚新建的lib文件夹下，即E:\private_project\solr\solr_home\solr\lib下面
5、解压mmseg4j-1.9.1\dist下面的mmseg4j-core-1.9.1.jar，将3个*.dic文件复制到E:\private_project\solr\solr_home\solr\dic下面
6、编辑E:\private_project\solr\solr_home\solr\collection1\conf下面的schema.xml，在合适的位置加上下面的代码
<field name="simple" type="textSimple" indexed="true" stored="true"/>
<field name="complex" type="textComplex" indexed="true" stored="true"/>
<field name="MaxWord" type="textMaxWord" indexed="true" stored="true"/>

<copyField source="simple" dest="text" />
<copyField source="complex" dest="text"/>
<copyField source="MaxWord" dest="text"/>

<fieldType name="textComplex" class="solr.TextField">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="E:/private_project/solr/solr_home/solr/dic"/>
</analyzer>
</fieldType>
<fieldType name="textMaxWord" class="solr.TextField">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="E:/private_project/solr/solr_home/solr/dic"/>
</analyzer>
</fieldType>
<fieldType name="textSimple" class="solr.TextField">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="E:/private_project/solr/solr_home/solr/dic"/>
</analyzer>
</fieldType>

注意dicPath为你的*.dic文件存放的路径
7、编辑E:\private_project\solr\solr_home\solr\collection1\conf下面的solrconfig.xml，在合适的位置加上下面的代码
<lib dir="E:/private_project/solr/solr_home/solr/lib" regex=".*\.jar" />
注意dir为从mmseg4j-1.9.1\dist下复制的那3个jar包的路径
8、重启tomcat，访问http://localhost:8080/solr/#/collection1/analysis ，“Analyse Fieldname / FieldType:”类型选择MaxWord，然后在“Field Value (Index)”
下面的文本框里输入“solr是一个伟大的开源的搜索引擎”，就能看到搜索效果
MMSeg 算法有两种分词方法：Simple和Complex，都是基于正向最大匹配。Complex 加了四个规则过虑。官方说：词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] solr4.4 + mmseg4j-1.9.1中文分词

浏览过的版块

扫码加入运维网微信交流群