public Query like(int docNum) throws IOException {
if (fieldNames == null) {
// gather list of valid fields from lucene
Collection fields = ir
.getFieldNames(IndexReader.FieldOption.INDEXED);
fieldNames = fields.toArray(new String[fields.size()]);
}
return createQuery(retrieveTerms(docNum));
}
在创建这个“神奇”的query之前,我们先要获得相关的原始term(retrieveTerms)。
public PriorityQueue retrieveTerms(int docNum) throws IOException {
Map termFreqMap = new HashMap();
for (int i = 0; i < fieldNames.length; i++) {
String fieldName = fieldNames;
TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);
// field does not store term vector info
if (vector == null) {
Document d = ir.document(docNum);
String text[] = d.getValues(fieldName);
if (text != null) {
for (int j = 0; j < text.length; j++) {
addTermFrequencies(new StringReader(text[j]), termFreqMap,
fieldName);
}
}
} else {
addTermFrequencies(termFreqMap, vector);
}
}
return createQueue(termFreqMap);
}
首先获取每一个字段的TermFreqVector,然后将其添加到TermFrequencies中,该过程是计算TF的过程,结果存放在map中,key为term,value为该term出现的次数(termFrequencies)。
在该过程中需要降噪,及去掉一些无关紧要的term,其判断方式如下:
private boolean isNoiseWord(String term) {
int len = term.length();
if (minWordLen > 0 && len < minWordLen) {
return true;
}
if (maxWordLen > 0 && len > maxWordLen) {
return true;
}
if (stopWords != null && stopWords.contains(term)) {
return true;
}
return false;
}
主要两个依据:
1.term长度必须在minWordLen和maxWordLen范围内;
2.term不应出现在stopWords内。
我们再回到retrieveTerms方法中,他返回的是一个PriorityQueue,因此我们还要将之前创建的map(tf)进行一定的处理(重要)。 “Find words for a more-like-this query former.” “Create a PriorityQueue from a word->tf map.”
private PriorityQueue createQueue(Map words)
throws IOException {
// have collected all words in doc and their freqs
int numDocs = ir.numDocs();
FreqQ res = new FreqQ(words.size()); // will order words by score
Iterator it = words.keySet().iterator();
while (it.hasNext()) { // for every word
String word = it.next();
int tf = words.get(word).x; // term freq in the source doc
if (minTermFreq > 0 && tf < minTermFreq) {
continue; // filter out words that don't occur enough times in the
// source
}
// go through all the fields and find the largest document frequency
String topField = fieldNames[0];
int docFreq = 0;
for (int i = 0; i < fieldNames.length; i++) {
int freq = ir.docFreq(new Term(fieldNames, word));
topField = (freq > docFreq) ? fieldNames : topField;
docFreq = (freq > docFreq) ? freq : docFreq;
}
if (minDocFreq > 0 && docFreq < minDocFreq) {
continue; // filter out words that don't occur in enough docs
}
if (docFreq > maxDocFreq) {
continue; // filter out words that occur in too many docs
}
if (docFreq == 0) {
continue; // index update problem?
}
float idf = similarity.idf(docFreq, numDocs);
float score = tf * idf;
// only really need 1st 3 entries, other ones are for troubleshooting
res.insertWithOverflow(new Object[] {word, // the word
topField, // the top field
Float.valueOf(score), // overall score
Float.valueOf(idf), // idf
Integer.valueOf(docFreq), // freq in all docs
Integer.valueOf(tf)});
}
return res;
}
该方法我们遍历所有的term,并取出其tf以及在所有指定字段(例如:mlt.fl=ti,ab,mcn)中最大的df。根据df和当前索引文档数计算idf,然后计算该term的score=tf*idf。
创建好PriorityQueue后,我们就可以将他转变成之前提到的那个“神奇”的query了。 “Create the More like query from a PriorityQueue”
(2)根据提供的Query,利用lucene的打分算法,找到相似文档。
Lucene 将信息检索中的Boolean model (BM)和Vector Space Model (VSM)联合起来,实现了自己的评分机制。
具体内容参见:
http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/core/org/apache/lucene/search/Similarity.html