Apache Solr使用自定义QParser后同义词扩展及Token去重的感悟

xiaoyu28 · 发表于 2015-7-16 11:25:56

　　好久没写博客了。近期在用solr做一套系统，期间有不少心得尚未记录。这里先记录一下solr中自定义QParser如何与SynonymFilter和RemoveDuplicatesTokenFilter配合以实现检索时Token同义词扩展与Token去重。
　　起初按照solr wiki上的说明，在schema.xml里配置了如下filter：

1

2

3

4

5

6

7

　　但是在实际使用过程中，发现RemoveDuplicatesTokenFilterFactory并未能过滤掉重复的Token，例如：“摩托罗拉 motorola 里程碑2代”，经过同义词扩展后（此处的同义词扩展为品牌中英文扩展，下同）变成了“摩托罗拉摩托 motorola moto motorola 摩托罗拉摩托 moto 里程碑 2代”，其中的【摩托罗拉】、【摩托】、【motorola】、【moto】都重复了一次。而我使用了基于DisMaxQParser的自定义Qparser，因此扩展后的同义词会对min-should-match参数带来影响，降低匹配精度。
　　为了看看究竟为何RemoveDuplicatesTokenFilter不起作用，打开它的源码看了一下：

01	@Override

02	public boolean incrementToken() throws IOException {

03	while (input.incrementToken()) {

04	final char term[] = termAttribute.buffer();

05	final int length = termAttribute.length();

06	final int posIncrement = posIncAttribute.getPositionIncrement();

07	if (posIncrement > 0) {

08	previous.clear();

09

}

10	boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));

11	// clone the term, and add to the set of seen terms.

12	char saved[] = new char[length];

13	System.arraycopy(term, 0, saved, 0, length);

14	previous.add(saved);

15	if (!duplicate) {

16	return true;

17

}

18

}

19	return false;

20

}

　　可以看出来，RemoveDuplicatesTokenFilter只对positionIncrement为0的token进行判断是否重复；但是，经过SynonymFilter扩展出的同义词，虽然positionIncrement为0，但肯定不会与原Token重复的，后面可能出现的重复的Token则因为positionIncrement必然大于0而导致无法去重了。
　　针对这种情况，决定从自定义的QParser入手，采用以下思路来解决问题：

想办法在QParser中获得Solr的TokenizerChain，从中获取SynonymFilterFactory
在QParser中取得分词的Analyzer，并通过Analyzer的TokenStream构建SynonymFilter实例
通过SynonymFilter遍历Token（调用incrementToken方法），并针对同义词扩展的positionIncrement进行逻辑判断：
- 若positionIncrement>0，则判断该词是否已经出现过，未出现则放行，并放在Set中待下次判断是否重复
- 若positionIncrement==0，则只放在Set中供下次判断

　　经过这样的处理逻辑，实际上除了过滤掉了重复的Token，还完成了Token“归一化”的过程。因为自定义QParser在solr检索的生命周期中要先于schema.xml中配置的TokenizerChain，因此在归一化之后，还会再进行一次同义词扩展，扩展之后，不会出现重复的Token，也不影响检索的精度了。
　　部分代码如下：

01	Analyzer analyzer = req.getSchema().getQueryAnalyzer();

02	final TokenizerChain tokennizerChain = (TokenizerChain) req.getSchema().getField("title").getType().getQueryAnalyzer();

03	SynonymFilterFactory sff = null;

04	for (TokenFilterFactory tf : tokennizerChain.getTokenFilterFactories()) {

05	if (tf instanceof SynonymFilterFactory) {

06	sff = (SynonymFilterFactory) tf;

07

}

08

}

09	if (null == analyzer) {

10	return;

11

}

12

…………

13	StringReader reader = new StringReader(qstr);

14	StringBuilder buffer = new StringBuilder(128);

15	Set tokenSet = new LinkedHashSet();

16

…………

17	TokenStream tokens = analyzer.reusableTokenStream("title", reader);

18	SynonymFilter sf = sff.create(tokens);

19	sf.reset();

20	TermAttribute termAtt = (TermAttribute) sf.getAttribute(TermAttribute.class);

21	PositionIncrementAttribute positionIncrementAttribute = sf.getAttribute(PositionIncrementAttribute.class);

22	OffsetAttribute offsetAttribute = sf.getAttribute(OffsetAttribute.class);

23	Set dumplicatedTokenSet = new HashSet();

24	while (sf.incrementToken()) {

25	final String token = (new String(termAtt.termBuffer(), 0, termAtt.termLength())).toLowerCase();

26	final int posIncr = positionIncrementAttribute.getPositionIncrement();

27	if (posIncr > 0) {

28	if (!dumplicatedTokenSet.contains(token)) {

29	dumplicatedTokenSet.add(token);

30	tokenSet.add(token);

31

}

32	} else {

33	dumplicatedTokenSet.add(token);

34

}

35

}

36

…………

37	for (String tok : tokenSet) {

38	buffer.append(tok).append(" ");

39

}

40	if (buffer.length() > 0) {

41	qstr = buffer.toString();

42

}

　　后记：
solr的DisjunctionMaxQuery是个很有意思的东西，抽时间好好看一下代码，总结一下。
　　摘自：http://www.jnan.org/archives/2011/10/hacking-solr-synonymfilter-and-removeduplicatestokenfilter-with-custom-qparser.html#more-528
　　
　　

账号		自动登录	找回密码
密码			立即注册

最新rhel8官方手册三本PDF

winhex数据恢复教程（非常巨大，内容丰富）

KMSpico10.2.0 免费激活Win10/Office2016（

zabbix3.4中文手册，官网完整COPY（2019042

zabbix3.4.1安装部署+微信推送信息+大屏显

VMware vcenter+vSphere 6.5 U2共享

CentOS6.5下redis-3.2.6的安装与配置

[经验分享] Apache Solr使用自定义QParser后同义词扩展及Token去重的感悟

扫码加入运维网微信交流群