ElasticSearch搜索底层基础原理总结

coverl · 发表于 2019-1-29 08:26:49

目录：
1._search结果分析
2.multi-index和multi-type
3.分页查询与deep paging
4.query DSL和query string
5.mapping
6.倒排索引和正排索引（doc value）
7、分词器
8.exact value和full text
09.建立索引
10.search api
11.document相关度评分DF&IDF算法
12.搜索相关参数
13.query phase
14.fetch phase
1._search结果分析
GET /_search
{
  "took": 6,
  "timed_out": false,
  "_shards": {
"total": 6,
"successful": 6,
"failed": 0
  },
  "hits": {
"total": 10,
"max_score": 1,
"hits": [
   {
      "_index": ".kibana",
      "_type": "config",
      "_id": "5.2.0",
      "_score": 1,
      "_source": {
      "buildNum": 14695
      }
   }
]
  }
}　　took：整个搜索请求花费了多少毫秒
　　hits.total：本次搜索，返回了几条结果
　　hits.max_score：本次搜索的所有结果中，最大的相关度分数是多少，每一条document对于search的相关度，越相关，_score分数越大，排位越靠前
　　hits.hits：默认查询前10条数据，完整数据，_score降序排序
　　shards：shards fail的条件（primary和replica全部挂掉），不影响其他shard。默认情况下来说，一个搜索请求，会打到一个index的所有primary shard上去，当然了，每个primary shard都可能会有一个或多个replic shard，所以请求也可以到primary shard的其中一个replica shard上去。
　　timeout：默认无timeout，latency平衡completeness，手动指定timeout，timeout查询执行机制
　　格式：timeout=10ms，timeout=1s，timeout=1m
　　GET /_search?timeout=10m
　　

2.multi-index和multi-type
　　2.1、multi-index和multi-type搜索模式
　　如何一次性搜索多个index和多个type下的数据
　　 /_search：所有索引，所有type下的所有数据都搜索出来
　　 /index1/_search：指定一个index，搜索其下所有type的数据
　　 /index1,index2/_search：同时搜索两个index下的数据
　　 /*1,*2/_search：按照通配符去匹配多个索引
　　 /index1/type1/_search：搜索一个index下指定的type的数据
　　 /index1/type1,type2/_search：可以搜索一个index下多个type的数据
　　 /index1,index2/type1,type2/_search：搜索多个index下的多个type的数据
　　 /_all/type1,type2/_search：_all，可以代表搜索所有index下的指定type的数据
　　2.2、初步图解简单的搜索原理

3.分页查询与deep paging
　　3.1、使用es进行分页搜索的语法
size，from
GET /_search?size=10
GET /_search?size=10&from=0
GET /_search?size=10&from=20
//分页的上机实验
GET /test_index/test_type/_search
"hits": {
"total": 9,
"max_score": 1,
//我们假设将这9条数据分成3页，每一页是3条数据，来实验一下这个分页搜索的效果
GET /test_index/test_type/_search?from=0&size=3
{
  "took": 2,
  "timed_out": false,
  "_shards": {
"total": 5,
"successful": 5,
"failed": 0
  },
  "hits": {
"total": 9,
"max_score": 1,
"hits": [
   {
      "_index": "test_index",
      "_type": "test_type",
      "_id": "8",
      "_score": 1,
      "_source": {
      "test_field": "test client 2"
      }
   },
   {
      "_index": "test_index",
      "_type": "test_type",
      "_id": "6",
      "_score": 1,
      "_source": {
      "test_field": "tes test"
      }
   },
   {
      "_index": "test_index",
      "_type": "test_type",
      "_id": "4",
      "_score": 1,
      "_source": {
      "test_field": "test4"
      }
   }
]
  }
}　　3.2、deep paging问题？为什么会产生这个问题，它的底层原理是什么？

　　搜索过深的时候，就需要在coordinate node上保存大量的数据，还要进行大量数据的排序，排序之后，再取出对应的那一页。所以这个过程，既消耗网络带宽，消耗内存，消耗CPU。所以应尽量避免。
4.query DSL和query string
　　4.1query string search语法和_all metedata
　　1、query string基础语法
　　GET /test_index/test_type/_search?q=test_field:test
　　GET /test_index/test_type/_search?q=+test_field:test(必须包含)
　　GET /test_index/test_type/_search?q=-test_field:test（不包含）
　　一个是掌握q=field:search content的语法，还有一个是掌握+和-的含义
　　2、_all metadata的原理和作用
　　GET /test_index/test_type/_search?q=test
　　直接可以搜索所有的field，任意一个field包含指定的关键字就可以搜索出来。我们在进行中搜索的时候，难道是对document中的每一个field都进行一次搜索吗？不是的
　　es中的_all元数据，在建立索引的时候，我们插入一条document，它里面包含了多个field，此时，es会自动将多个field的值，全部用字符串的方式串联起来，变成一个长的字符串，作为_all field的值，同时建立索引
　　后面如果在搜索的时候，没有对某个field指定搜索，就默认搜索_all field，其中是包含了所有field的值的
　　4.2query DSL
GET /_search
{
"query": {
      "match_all": {}
}
}
2、Query DSL的基本语法
{
QUERY_NAME: {
      ARGUMENT: VALUE,
      ARGUMENT: VALUE,...
}
}
{
QUERY_NAME: {
      FIELD_NAME: {
         ARGUMENT: VALUE,
         ARGUMENT: VALUE,...
      }
}
}
示例：
GET /test_index/test_type/_search
{
  "query": {
"match": {
   "test_field": "test"
}
  }
}　　4.2.1如何组合多个搜索条件
　　搜索需求：title必须包含elasticsearch，content可以包含elasticsearch也可以不包含，author_id必须不为111
GET /website/article/_search
{
  "query": {
"bool": {
   "must": [
      {
      "match": {
         "title": "elasticsearch"
      }
      }
   ],
   "should": [
      {
      "match": {
         "content": "elasticsearch"
      }
      }
   ],
   "must_not": [
      {
      "match": {
         "author_id": 111
      }
      }
   ]
}
  }
}
GET /test_index/_search
{
"query": {
         "bool": {
            "must": { "match": { "name": "tom" }},
            "should": [
                  { "match":    { "hired": true }},
                  { "bool": {
                     "must":    { "match": { "personality": "good" }},
                     "must_not":  { "match": { "rude": true }}
                  }}
            ],
            "minimum_should_match": 1
         }
}
}4.2.2、filter与query对比大解密
　　filter，仅仅只是按照搜索条件过滤出需要的数据而已，不计算任何相关度分数，对相关度没有任何影响
　　query，会去计算每个document相对于搜索条件的相关度，并按照相关度进行排序
　　一般来说，如果你是在进行搜索，需要将最匹配搜索条件的数据先返回，那么用query；如果你只是要根据一些条件筛选出一部分数据，不关注其排序，那么用filter
　　除非是你的这些搜索条件，你希望越符合这些搜索条件的document越排在前面返回，那么这些搜索条件要放在query中；如果你不希望一些搜索条件来影响你的document排序，那么就放在filter中即可
4.2.3、filter与query性能
　　filter，不需要计算相关度分数，不需要按照相关度分数进行排序，同时还有内置的自动cache最常使用filter的数据
　　query，相反，要计算相关度分数，按照分数进行排序，而且无法cache结果
4.2.4、Query常用搜索语法
1、match all
GET /_search
{
"query": {
      "match_all": {}
}
}
2、match
GET /_search
{
"query": { "match": { "title": "my elasticsearch article" }}
}
3、multi match
GET /test_index/test_type/_search
{
  "query": {
"multi_match": {
   "query": "test",
   "fields": ["test_field", "test_field1"]
}
  }
}
4、range query
GET /company/employee/_search
{
  "query": {
"range": {
   "age": {
      "gte": 30
   }
}
  }
}
5、term query
GET /test_index/test_type/_search
{
  "query": {
"term": {
   "test_field": "test hello"
}
  }
}
6、terms query
GET /_search
{
"query": { "terms": { "tag": [ "search", "full_text", "nosql" ] }}
}
7、exist query（2.x中的查询，现在已经不提供了）　　4.2.5、多搜索条件组合查询

　　bool
　　must，must_not，should，filter
　　每个子查询都会计算一个document针对它的相关度分数，然后bool综合所有分数，合并为一个分数，当然filter是不会计算分数的
{
"bool": {
      "must":    { "match": { "title": "how to make millions" }},
      "must_not": { "match": { "tag": "spam" }},
      "should": [
         { "match": { "tag": "starred" }}
      ],
      "filter": {
      "bool": {
            "must": [
               { "range": { "date": { "gte": "2014-01-01" }}},
               { "range": { "price": { "lte": 29.99 }}}
            ],
            "must_not": [
               { "term": { "category": "ebooks" }}
            ]
      }
      }
}
}5.mapping
　　（1）往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping
　　（2）mapping中就自动定义了每个field的数据类型
　　（3）不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
　　（4）exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
　　（5）同时呢，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索
　　（6）可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等
　　mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为
插入几条数据，让es自动建立一个索引
PUT /website/article/1
{
  "post_date": "2017-01-01",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}
PUT /website/article/2
{
  "post_date": "2017-01-02",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}
PUT /website/article/3
{
  "post_date": "2017-01-03",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}
尝试各种搜索
GET /website/article/_search?q=2017 3条结果
GET /website/article/_search?q=2017-01-01       3条结果
GET /website/article/_search?q=post_date:2017-01-01 1条结果
GET /website/article/_search?q=post_date:2017       1条结果　　搜索结果为什么不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。
　　下面解释
GET /_search?q=2017
搜索的是_all field，document所有的field都会拼接成一个大串，进行分词
2017-01-02 my second article this is my second article in this website 11400
doc1  doc2  doc3
2017  *  *  *
01 *
02    *
03       *
_all，2017，自然会搜索到3个docuemnt
-------------------------------------------------------------
GET /_search?q=2017-01-01
_all，2017-01-01，query string会用跟建立倒排索引一样的分词器去进行分词
2017
01
01
----------------------------------------------------------------
GET /_search?q=post_date:2017-01-01
date，会作为exact value去建立索引
         doc1 doc2  doc3
2017-01-01    *
2017-01-02          *
2017-01-03                *
post_date:2017-01-01，2017-01-01，doc1一条document
-----------------------------------------------------------
GET /_search?q=post_date:2017，这个在这里不说，因为是es 5.2以后做的一个优化　　
5.1、query string分词
query string必须以和index建立时相同的analyzer进行分词
query string对exact value和full text的区别对待

date：exact value
_all：full text

比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引
我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string
query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索
我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行。才能搜索到。
知识点：不同类型的field，可能有的就是full text，有的就是exact value
post_date，date：exact value
_all：full text，分词，normalization
5.2、mapping数据类型
　　1、核心的数据类型
　　string
　　byte，short，integer，long
　　float，double
　　boolean
　　date
　　2、dynamic mapping
　　true or false --> boolean
　　123  --> long
　　123.45  --> double
　　2017-01-01 --> date
　　"hello world" --> string/text
　　3、查看mapping
　　GET /index/_mapping/type
5.3、mapping复杂数据类型
1、multivalue field
{ "tags": [ "tag1", "tag2" ]}
建立索引时与string是一样的，数据类型不能混
2、empty field
null，[]，[null]
3、object field
PUT /company/employee/1
{
  "address": {
"country": "china",
"province": "guangdong",
"city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}
address：object类型
{
  "company": {
"mappings": {
   "employee": {
      "properties": {
      "address": {
         "properties": {
            "city": {
            "type": "text",
            "fields": {
               "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
               }
            }
            },
            "country": {
            "type": "text",
            "fields": {
               "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
               }
            }
            },
            "province": {
            "type": "text",
            "fields": {
               "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
               }
            }
            }
         }
      },
      "age": {
         "type": "long"
      },
      "join_date": {
         "type": "date"
      },
      "name": {
         "type": "text",
         "fields": {
            "keyword": {
            "type": "keyword",
            "ignore_above": 256
            }
         }
      }
      }
   }
}
  }
}　　6.倒排索引和正排索引（doc value）
　　搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values
　　在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面会建立正排索引，也就是doc values，以供排序，聚合，过滤等操作使用
　　doc values是被保存在磁盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是会很高；如果内存不足够，os会将其写入磁盘上
　　倒排索引

　　doc1: hello world you and me
　　doc2: hi, world, how are you
　　
　　word  doc1  doc2
　　
　　hello  *
　　world  *    *
　　you   *    *
　　and *
　　me    *
　　hi    *
　　how *
　　are *

　　正排索引

　　oc1: { "name": "jack", "age": 27 }
　　doc2: { "name": "tom", "age": 30 }
　　document name  age
　　doc1  jack  27
　　doc2  tom  30

7、分词器
　　什么是分词器：切分词语，normalization（提升recall召回率）
　　给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分词器
　　recall，召回率：搜索的时候，增加能够搜索到的结果的数量
　　character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（hello --> hello），& --> and（I&you --> I and you）
　　tokenizer：分词，hello you and me --> hello, you, and, me
　　token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little
　　一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引
8.exact value和full text
　　1、exact value
      2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来
      如果你输入一个01，是搜索不出来的
2、full text
（1）缩写 vs. 全程：cn vs. china
（2）格式转化：like liked likes
（3）大小写：Tom vs tom
（4）同义词：like vs love
09.建立索引
　　1、如何建立索引
　　analyzed：进行分词
　　not_analyzed：不进行分词
　　no：不建立索引，不被查询
　　2、修改mapping
　　只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping
PUT /website
{
  "mappings": {
"article": {
   "properties": {
      "author_id": {
      "type": "long"
      },
      "title": {
      "type": "text",
      "analyzer": "english"
      },
      "content": {
      "type": "text"
      },
      "post_date": {
      "type": "date"
      },
      "publisher_id": {
      "type": "text",
      "index": "not_analyzed"
      }
   }
}
  }
}　　
10.search api
10.1、search api的基本语法
GET /search
{}
GET /index1,index2/type1,type2/search
{}
GET /_search
{
  "from": 0,
  "size": 10
}10.2、http协议中get是否可以带上request body
　　HTTP协议，一般不允许get请求带上request body，但是因为get更加适合描述查询数据的操作，因此还是这么用了
　　GET /_search?from=0&size=10
　　POST /_search
　　{
　　  "from":0,
　　  "size":10
　　}
　　碰巧，很多浏览器，或者是服务器，也都支持GET+request body模式
　　如果遇到不支持的场景，也可以用POST /_search
11.document相关度评分DF&IDF算法
　　relevance score算法，简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度
　　Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法
　　

　　Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关
　　搜索请求：hello world
　　doc1：hello you, and world is very good
　　doc2：hello, how are you
　　
　　Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关
　　搜索请求：hello world
　　doc1：hello, today is very good
　　doc2：hi world, how are you
　　比如说，在index中有1万条document，hello这个单词在所有的document中，一共出现了1000次；world这个单词在所有的document中，一共出现了100次
　　doc2更相关
　　
　　Field-length norm：field长度，field越长，相关度越弱
　　搜索请求：hello world
　　doc1：{ "title": "hello article", "content": "babaaba 1万个单词" }
　　doc2：{ "title": "my article", "content": "blablabala 1万个单词，hi world" }
　　hello world在整个index中出现的次数是一样多的
　　doc1更相关，title field更短
　　

12.搜索相关参数
　　1、preference
　　决定了哪些shard会被用来执行搜索操作
　　_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3
　　bouncing results问题，两个document排序，field值相同；不同的shard上，可能排序不同；每次请求轮询打到不同的replica shard上；每次页面上看到的搜索结果的排序都不一样。这就是bouncing result，也就是跳跃的结果。
　　搜索的时候，是轮询将搜索请求发送到每一个replica shard（primary shard），但是在不同的shard上，可能document的排序不同
　　解决方案就是将preference设置为一个字符串，比如说user_id，让每个user每次搜索的时候，都使用同一个replica shard去执行，就不会看到bouncing results了
　　

　　2、timeout，已经讲解过原理了，主要就是限定在一定时间内，将部分获取到的数据直接返回，避免查询耗时过长
　　3、routing，document文档路由，_id路由，routing=user_id，这样的话可以让同一个user对应的数据到一个shard上去
　　4、search_type
　　default：query_then_fetch
　　dfs_query_then_fetch，可以提升revelance sort精准度
　　
13.query phase
　　1、query phase
　　（1）搜索请求发送到某一个coordinate node，构构建一个priority queue，长度以paging操作from和size为准，默认为10
　　（2）coordinate node将请求转发到所有shard，每个shard本地搜索，并构建一个本地的priority queue
　　（3）各个shard将自己的priority queue返回给coordinate node，并构建一个全局的priority queue
　　2、replica shard如何提升搜索吞吐量
　　一次请求要打到所有shard的一个replica/primary上去，如果每个shard都有多个replica，那么同时并发过来的搜索请求可以同时打到其他的replica上去
　　

14.fetch phase

　　1、fetch phbase工作流程
　　（1）coordinate node构建完priority queue之后，就发送mget请求去所有shard上获取对应的document
　　（2）各个shard将document返回给coordinate node
　　（3）coordinate node将合并后的document结果返回给client客户端
　　2、一般搜索，如果不加from和size，就默认搜索前10条，按照_score排序
　　

账号		自动登录	找回密码
密码			立即注册

wirelessnetview好用的无线分析工具

Red Hat RHCE 8 (EX294) Cert Guide

Shell从入门到精通（阿良）

亿图图示专家(EDraw Max) V7.9 中文破解版

zabbix3.4.1安装部署+微信推送信息+大屏显

Red Hat OpenShift I: Containers & Kubern

2025 年，C++ 还能“硬核”多久？

[经验分享] ElasticSearch搜索底层基础原理总结

浏览过的版块

扫码加入运维网微信交流群