全文检索引擎及工具 Lucene Solr

潇洒紫焰 · 发表于 2017-12-19 09:22:22

　　import org.slf4j.LoggerFactory
　　import com.typesafe.config.{ConfigFactory, _}
　　import org.apache.lucene
　　import lucene.index.{DirectoryReader, IndexWriter, IndexWriterConfig, _}
　　import lucene.document.{Field, FieldType, _}
　　import lucene.analysis.{CharArraySet, _}
　　import lucene.search.{IndexSearcher, _}
　　import lucene.store.{RAMDirectory, _}
　　import lucene.analysis.cn.smart.{SmartChineseAnalyzer, _}
　　import org.apache.lucene.index.IndexWriterConfig.OpenMode
　　import org.apache.lucene.queryparser.classic.QueryParser
　　import scala.collection.convert.ImplicitConversions._
　　import scala.io.Source
　　import scala.util.{Failure, Success, Try}
　　val log = LoggerFactory.getLogger(this.getClass)
　　val conf = ConfigFactory.load("app")  // 读取文件配置，对Java程序员来说相当于读 app.properties配置文件

　　log.info("creating lucene index for wikipedia>　　// 保存lucene索引文件的目录，例如/path/to/index/
　　val indexDir = conf.getString("lucene.indexDir")
　　log.debug(s"lucene index writing direcotry: $indexDir")
　　import org.apache.lucene
　　// 创建Directory对象，或创建写入硬盘的（FSDirectory）或直接在内存中操作的（RAMDirectory）。前者需要提供一个写入索引的目录参数。

　　// val>　　//小示例，直接在内存中操作算了

　　val>　　val stopWordsFiles = conf.getString("lucene.stopWordsFiles")
　　log.debug(s"lucene.stopWordsFiles: $stopWordsFiles")
　　// “停用词”表，以过滤分词结果中没啥用处的停用词
　　val stopWords = stopWordsFiles.split(",").flatMap(f => Try {
　　if (f.trim.nonEmpty)
　　Source.fromFile(f).getLines()
　　else
　　Iterator.empty
　　} match {
　　case Success(x) => x
　　case Failure(e) =>
　　log.warn(s"error in loading stop words file: $f", e)
　　Iterator.empty
　　}).toList

　　log.debug(s"stop words>　　val smartcn = new SmartChineseAnalyzer(new CharArraySet(stopWords, true))
　　val iwConf = new IndexWriterConfig(smartcn)
　　// iwConf.setOpenMode(OpenMode.CREATE) // RAM中操作“重新创建”、“追加”几种写入模式都无所谓了
　　val indexWriter = new IndexWriter(idxDir, iwConf)
　　// $lucene.wikiIdTitle表示文档集合文件的路径，示例中该文件内容如下
　　/*
　　1832186 义胆雄心
　　5376724 龙山洞
　　5420049 地下情人
　　5431949 里弗顿
　　5455483 长隆
　　5463308 阿尔伯特桥
　　5470979 冈田
　　5511092 肖迪奇站
　　5544906 莫农加希拉_(消歧义)
　　5553846 蓬莱洞
　　5553849 南山洞
　　5566592 开水
　　5566629 氧化锑
　　*/
　　val pgIdTtls=Source.fromFile(conf.getString("lucene.wikiIdTitle"))
　　.getLines()
　　.filter(ln => ln.nonEmpty)
　　.map(ln => {

　　val>
　　(idTtl(0),>　　})
　　//pgIdTtls变量是一个二元组，对于Java程序员，不必了解什么是二元组，可想象成两列，第一列是唯一id，第二列是文本
　　pgIdTtls.foreach(e => {
　　import lucene.document._
　　//创建一个Document
　　val ldoc = new lucene.document.Document()
　　//由于第一列是文档唯一id，不需要被索引，但需要被保存，以便在检索到结果之后能“看到”这个id字段，不保存（.setStored(false)）则即使被添加到document对象，在结果中也看不到该字段
　　val pageIdFieldType = new FieldType()
　　pageIdFieldType.setStored(true)  // 要保存，因为我们想在结果中得到这个值
　　pageIdFieldType.setIndexOptions(lucene.index.IndexOptions.NONE)  //别在id字段上做索引
　　// 事实上，为了性能，可以把pageIdFieldType移到循环外，示例中放在这里是为了让FieldType的出现看起来更合理
　　// 为文档创建字段，名为pageId（字段名随意取，不过在检索结果中提取时你得记得这个字段的名字），值为第一列（即文档唯一编号）
　　val piFld = new lucene.document.Field("pageId", e._1, pageIdFieldType)
　　ldoc.add(piFld)
　　//把文档的文本内容添加到title字段，要保存（Field.Store.YES）
　　ldoc.add(new lucene.document.TextField("title", e._2, Field.Store.YES))
　　indexWriter.addDocument(ldoc)  // 写入index
　　})
　　//indexWriter.close()  // 如果用硬盘写入（FSDirectory.open方式），要记得关闭
　　val searcher = new IndexSearcher(DirectoryReader.open(indexWriter))
　　// 如果是读取某个目录下的index，则应该用
　　// val searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(java.nio.file.Paths.get("/path/to/index/"))))
　　import lucene.queryparser.classic._
　　// 在title字段上搜索
　　val queryParser = new QueryParser("title", smartcn)
　　// 搜索“南山洞”
　　val query = queryParser.parse("南山洞")
　　//
　　val hitPageIdTitles = searcher
　　.search(query, 30)  // 最多返回30个结果，还有更多我也不想要了（类似SQL的limit 30
　　.scoreDocs.map(searcher doc _.doc)
　　.map(d => (d.get("pageId"), d.get("title")))
　　//上一scala语句对java程序员来说理解起来可能稍微有点费力，翻译为java，是这样的
　　/*
　　ScoreDoc[] scoreDocs=searcher.search(query,30).scoreDocs;
　　for (ScoreDoc scoreDoc : scoreDocs) {
　　//得到检索到的Document
　　org.apache.lucene.document.Document d= searcher.doc(scoreDoc.doc);
　　// 从Document中获取文档编号和文本内容
　　System.out.println(d.get("pageId")+", "+d.get("title"))
　　}
　　*/
　　//检索到两个结果(5553849,南山洞), (5376724,龙山洞)

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 全文检索引擎及工具 Lucene Solr

浏览过的版块

扫码加入运维网微信交流群