搜索引擎–elasticsearch python客户端pyes 建立索引和搜索

lenf · 发表于 2017-5-8 08:59:16

主机环境:Ubuntu 13.04
Python版本：2.7.4
转载请标明：http://blog.yanming8.cn/archives/118

官方站点：http://www.elasticsearch.com/
中文站点：http://es-cn.medcl.net/
下面一段介绍引用自中文站点：

好吧，假如你建了一个web站点或者是一个应用程序，你就可能会需要添加搜索功能（因为这太有必要了），而事实上让搜索跑起来是有难度的，我们不仅想要搜索的速度快，而且还要安装方便（最好是无痛安装），另外模式定义要非常自由（schema free），可以通过HTTP以JSON格式的数据来进行索引，服务器必须是一直可用的（HA高可用，这个不能丢），从一台机器能够扩展到成千上万台，然后搜索必须是实时的（real-time），使用起来一定要简单、支持多租户，我们需要一整套的解决方案，并且是为云构建的。
“让搜索更简单”，这是我们的宣言，“并且要酷，像盆景一样”
elasticsearch的目标是解决上面的所有问题以及更多。她是开源的（Apache2协议），分布式的，RESTful的，构建在Apache Lucene之上的的搜索引擎.

1 、分布式服务器的安装：
首先下载http://www.elasticsearch.org/download/，选择合适的版本安装，这里直接下载了适合ubuntu的DEB包，下载完成后直接dpkg命令安装。安装完成后可以通过
sudo service elasticsearch start
来启动服务。

2、安装pyes客户端
使用命令

1	pip install pyes

安装elasticsearch的python的组件。

3、安装pyes的中文分词组件
直接下载https://github.com/medcl/elasticsearch-rtf/blob/master/elasticsearch/plugins/analysis-ik/elasticsearch-analysis-ik-1.2.2.jar中文分词组件
然后移动的elasticsearch的安装目录/usr/share/elasticsearch/analysis-ik/,
修改配置文件/etc/elasticsearch/elasticsearch.yml
设置插件的路径
path.plugins: /usr/share/elasticsearch/plugins
并添加分词组建配置

1

index:

2

analysis:

3

analyzer:

4

ik:

5	alias: [ik_analyzer]

6	type: org.elasticsearch.index.analysis.IkAnalyzerProvider

最后下载IK分词使用的词典

cd /etc/elasticsearch
wget http://github.com/downloads/medcl/elasticsearch-analysis-ik/ik.zip –no-check-certificate
unzip ik.zip
rm ik.zip

重启elasticsearch服务即可。

4、建立索引

01	#!/usr/bin/env python

02	#-- coding:utf-8--

03

importos

04

importsys

05	frompyesimport*

06

07	INDEX_NAME='txtfiles'

08

09	classIndexFiles(object):

10	def__init__(self,root):

11	conn=ES('127.0.0.1:9200', timeout=3.5)#连接ES

12

try:

13	conn.delete_index(INDEX_NAME)

14

#pass

15

except:

16

pass

17	conn.create_index(INDEX_NAME)#新建一个索引

18

19

#定义索引存储结构

20	mapping={u'content': {'boost':1.0,

21	'index':'analyzed',

22	'store':'yes',

23	'type': u'string',

24	"indexAnalyzer":"ik",

25	"searchAnalyzer":"ik",

26	"term_vector":"with_positions_offsets"},

27	u'name': {'boost':1.0,

28	'index':'analyzed',

29	'store':'yes',

30	'type': u'string',

31	"indexAnalyzer":"ik",

32	"searchAnalyzer":"ik",

33	"term_vector":"with_positions_offsets"},

34	u'dirpath': {'boost':1.0,

35	'index':'analyzed',

36	'store':'yes',

37	'type': u'string',

38	"indexAnalyzer":"ik",

39	"searchAnalyzer":"ik",

40	"term_vector":"with_positions_offsets"},

41

}

42

43	conn.put_mapping("test-type", {'properties':mapping}, [INDEX_NAME])#定义test-type

44

45	self.addIndex(conn,root)

46

47	conn.default_indices=[INDEX_NAME]#设置默认的索引

48	conn.refresh()#刷新以获得最新插入的文档

49

50	defaddIndex(self,conn,root):

51

printroot

52	forroot, dirnames, filenamesinos.walk(root):

53	forfilenameinfilenames:

54	ifnotfilename.endswith('.txt'):

55

continue

56	print"Indexing file ", filename

57

try:

58	path=os.path.join(root,filename)

59	file=open(path)

60	contents=unicode(file.read(),'utf-8')

61	file.close()

62	iflen(contents) >0:

63	conn.index({'name':filename,'dirpath':root,'content':contents},INDEX_NAME,'test-type')

64

else:

65	print'no contents in file %s',path

66	exceptException,e:

67

printe

68

69	if__name__=='__main__':

70	IndexFiles('./txtfiles')

5、搜索并高亮显示
view source

01	#!/usr/bin/env python

02	#-- coding:utf-8 --

03

04

importos

05

importsys

06	frompyesimport*

07

08	conn=ES('127.0.0.1:9200', timeout=3.5)#连接ES

09	sq=StringQuery(u'世界末日','content')

10	h=HighLighter(['<b>'], ['</b>'], fragment_size=20)

11

12	s=Search(sq,highlight=h)

13	s.add_highlight("content")

14	results=conn.search(s,indices='txtfiles',doc_types='test-type')

15

16

list=[]

17	forrinresults:

18	if(r._meta.highlight.has_key("content")):

19	r['content']=r._meta.highlight[u"content"][0]

20	list.append(r)

21	printr['content']

22	printlen(list)

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 搜索引擎–elasticsearch python客户端pyes 建立索引和搜索

浏览过的版块

扫码加入运维网微信交流群