InnoDB全文索引基础

4321pp 发表于 2017-8-29 09:47:09

全文索引：

官方文档：
https://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html

参考：
http://blog.csdn.net/u011734144/article/details/52817766
http://www.cnblogs.com/olinux/p/5169282.html

全文检索通常使用的是倒排索引。
内容详见姜大神的InnoDB存储引擎2 书上 Page231~Page248

InnoDB存储引擎从1.2.x开始支持全文索引技术，其采用full inverted index的方式。在InnoDB存储引擎中，将(DocumentID,Postition)视为一个ilist。因此在全文检索的表中，有两个列，一个是word字段，一个是ilist字段。并且在word字段上有设索引。此外，由于InnoDB存储引擎在ilist字段上存放了Position信息，故可以进行Proximity Search，而MyISAM不支持该特性

如之前所说，倒排索引需要将word存放在一个表中，这个表称为Auxiliary Table(辅助表)在InnoDB存储引擎中，为了提高全文检索的并发性。共有6张Auxiliary Table，每张表根据word的Latin编码进行分区

Auxiliary Table是持久的表，存放在磁盘上，然而在InnoDB存储引擎的全文索引中，还有另外一个重要的概念FTS Index Cache(全文检索索引缓存)，其用来提高全文检索的性能

FTS Index Cache是一个红黑树结构，其根据(word,ilist)进行排序，这意味着插入的数据已更新了对应的表，但是对全文索引的更新可能在分词操作后还在FTS Index Cache中，Auxiliary Table可能没有更新。InnoDB存储引擎会批量对 Auxiliary Table 进行更新。而不是每次插入后更新一次Auxiliary Table。当全文检索进行查询时,Auxiliary Table首先会将在FTS Index Cache 中对应的word字段合并到Auxiliary Table中，然后进行查询。这种merge操作非常类似之前的Insert Buffer功能。不同的是Insert Buffer是个持久性的对象，并且是B+树结构，然后FTS Index Cache的作用又和Insert Buffer类似，它提高了InnoDB存储引擎的性能，并且由于其根据红黑树排序后进行批量插入，其产生的Auxiliary Table相对较小。

InnoDB存储引擎允许用户查看指定倒排索引的Auxiliary Table分词的信息，可以通过设置innodb_ft_aux_table来观察倒排索引的 Auxiliary Table 下面的SQL 语句设置查看test架构下表fts_a的Auxiliary Table:

SET GLOBAL innodb_ft_aux_table='test/fts_a';
设置后，可以在information_schema架构下的表INNODB_FT_INDEX_TABLE得到表fts_a中的分词信息。

对于InnoDB存储引擎而言，其总是在事务提交时将分词写入到FTS Index Cache,然后通过批量写入到磁盘。虽然InnoDB存储引擎通过一种延时的、批量的写入方式来提高数据库的性能，但是上述操作仅在事务提交时发生。

当数据库关闭时，在FTS Index Cache中的数据库会同步到磁盘上的Auxiliary Table中。如果当数据库发生宕机时，一些FTS Index Cache中的数据可能未同步到磁盘上，那么下次重启数据库时，当用户对表进行全文检索(查询、插入)时,InnoDB存储引擎会自动读取未完成的文档，然后进行分词操作，再将分词结果放到FTS Index Cache中。

参数 innodb_ft_cache_size 用来控制FTS Index Cache的大小，默认值是32M。当该缓存满时，会将其中的(word,ilist)分词信息同步到磁盘的Auxiliary Table中。增大该参数可以提高全文检索性能。但是在宕机时候，未同步到磁盘中的索引信息可能需要更长的时间进行恢复。

为了支持全文检索，必须有一个列与word进行映射。在InnoDB中这个列被命名成FTS_DOC_ID，其类型为BIGINT UNSIGNED NOT NULL，并且InnoDB存储引擎自动会在该列加上一个名为FTS_DOC_ID_INDEX的Unique Index。这些操作由存储引擎自己完成，用户也可以在建表时自动添加FTS_DOC_ID,以及对应的Unique Index。由于列名FTS_DOC_ID聚友特殊意义，因此在创建时必须注意相应的类型，否则会报错。

文档中的分词的插入操作是在事务提交时完成，但是对于删除操作，其在事务提交时，不删除磁盘Auxiliary Table的记录，而只是删除FTS Cache Index记录，对于Auxiliary Table中被删除的记录，存储引擎会记录其FTS DOCUMENT ID ,并将其保存在DELETE auxiliary table中，在设置参数innodb_ft_aux_table后，用户可以访问information_schema架构下的表INNODB_FT_DELETED来观察删除的FTS Document ID

由于文档的DML操作实际并不删除索引中的数据，相反还会在对应的DELETED表中插入记录，因此随着应用程序的允许，索引会变得越来越大，即使索引中的有些数据已经被删除，查询也不会选择这类记录，为此，InnoDB提供了一种方式，允许用户手工将已删除的记录从索引中彻底删除，这就是OPTIMIZE TABLE。因为OPTIMIZE TABLE还会进行一些其他的操作。如Cardinality重新统计，若用户希望对倒排索引进行操作，可以通过innodb_optimize_fulltext_only设置
SET GLOBAL innodb_optimize_fulltext_only=1;
OPTIMIZE TABLE fts_a;

若被删除的文档很多，那么OPTIMIZE TABLE操作可能占用非常多的时间，会影响到程序并发性，并极大的降低用户的响应时间，用户可以通过参数innodb_ft_num_word_optimize来限制每次实际删除的分词数量，默认为2000

例子：
> use test;
> CREATE TABLE fts_a(
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
body TEXT,
PRIMARY KEY(FTS_DOC_ID)
);

INSERT INTO fts_a SELECT NULL,'pease porridge in the pot';
INSERT INTO fts_a SELECT NULL,'pease porridge hot,pease porridge cold';
INSERT INTO fts_a SELECT NULL,'Nine days old';
INSERT INTO fts_a SELECT NULL,'Some like it hot,some like it cold';
INSERT INTO fts_a SELECT NULL,'Some like it the pot';
INSERT INTO fts_a SELECT NULL,'Nine days old';
INSERT INTO fts_a SELECT NULL,'I like code days';

CREATE FULLTEXT INDEX idx_fts ON fts_a(body);

查看数据：
> select * from fts_a;
+------------+----------------------------------------+
| FTS_DOC_ID | body                               |
+------------+----------------------------------------+
|       1 | pease porridge in the pot          |
|       2 | pease porridge hot,pease porridge cold |
|       3 | Nine days old                      |
|       4 | Some like it hot,some like it cold |
|       5 | Some like it the pot                |
|       6 | Nine days old                      |
|       7 | I like code days                   |
+------------+----------------------------------------+
7 rows in set (0.00 sec)

> set global innodb_ft_aux_table='test/fts_a';
Query OK, 0 rows affected (0.00 sec)
> SELECT * FROM information_schema.`INNODB_FT_INDEX_TABLE`;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| code |          7 |       7 |       1 |    7 |    7 |
| cold |          2 |       4 |       2 |    2 |    34 |
| cold |          2 |       4 |       2 |    4 |    30 |
| days |          3 |       7 |       3 |    3 |    5 |
| days |          3 |       7 |       3 |    6 |    5 |
| days |          3 |       7 |       3 |    7 |    12 |
| hot    |          2 |       4 |       2 |    2 |    15 |
| hot    |          2 |       4 |       2 |    4 |    13 |
| like |          4 |       7 |       3 |    4 |    5 |
| like |          4 |       7 |       3 |    4 |    17 |
| like |          4 |       7 |       3 |    5 |    5 |
| like |          4 |       7 |       3 |    7 |    2 |
| nine |          3 |       6 |       2 |    3 |    0 |
| nine |          3 |       6 |       2 |    6 |    0 |
| old    |          3 |       6 |       2 |    3 |    10 |
| old    |          3 |       6 |       2 |    6 |    10 |
| pease |          1 |       2 |       2 |    1 |    0 |
| pease |          1 |       2 |       2 |    2 |    0 |
| pease |          1 |       2 |       2 |    2 |    19 |
| porridge |          1 |       2 |       2 |    1 |    6 |
| porridge |          1 |       2 |       2 |    2 |    6 |
| porridge |          1 |       2 |       2 |    2 |    19 |
| post |          1 |       1 |       1 |    1 |    22 |
| pot    |          5 |       5 |       1 |    5 |    17 |
| some |          4 |       5 |       2 |    4 |    0 |
| some |          4 |       5 |       2 |    4 |    17 |
| some |          4 |       5 |       2 |    5 |    0 |
+----------+--------------+-------------+-----------+--------+----------+

可以看到每个word对应一个DOC_ID和POSITION。此外，还记录了FIRST_DOC_ID、LAST_DOC_ID、DOC_COUNT分别代表该word第一次出现文档的ID,最后一次出现的文档ID，以及该word在多少个文档中存在。

若此时执行下面的SQL语句，会删除FTS_DOC_ID为7的文档
> DELETE FROM fts_a WHERE FTS_DOC_ID=7;

InnoDB存储引擎并不会直接删除索引中对应的记录，而是将删除的文档ID插入到DELETED表
> SELECT * FROM information_schema.`INNODB_FT_DELETED`;
+--------+
| DOC_ID |
+--------+
|    7 |
+--------+

如果用户想要彻底删除倒排索引中该文档的分词信息，可以执行：
> SET GLOBAL innodb_optimize_fulltext_only=1;
> OPTIMIZE TABLE fts_a;
+------------+----------+----------+----------+
| Table    | Op    | Msg_type | Msg_text |
+------------+----------+----------+----------+
| iot2.fts_a | optimize | status | OK    |
+------------+----------+----------+----------+

验证；
> SELECT * FROM information_schema.`INNODB_FT_DELETED`;
+--------+
| DOC_ID |
+--------+
| 7 |
+--------+

> SELECT * FROM information_schema.`INNODB_FT_BEING_DELETED`;
+--------+
| DOC_ID |
+--------+
| 7 |
+--------+

> SELECT count(*) FROM information_schema.`INNODB_FT_INDEX_TABLE`; -- INNODB_FT_INDEX_TABLE表里面剩余的行数
+----------+
| count(*) |
+----------+
| 24 |
+----------+

运行OPTIMIZE TABLE 可以将记录彻底删除，并且彻底删除的文档ID会记录到INNODB_FT_BEGIN_DELETED中。
此外，由于FTS_DOC_ID为7的这个文档已经被删除，因此不允许在插入这个文档ID，否则会抛出异常
> INSERT INTO fts_a SELECT 7,'I like this days';
ERROR 182 (HY000): Invalid InnoDB FTS Doc ID

stopword列表(stopword list)是本节最后阐述的一个概念，其表示该列表中的word不需要对其进行索引分词操作。例如，对于the这个单词，由于其不具有具体的意义，因此将其视为stopword，InnoDB存储引擎有一张默认的stopword列表，在information_schema架构下，表名为INNODB_FT_DEFAULT_STOPWORD，默认为36个stopword可以通过参数 innodb_ft_server_stopword_table 来定义stopword列表，如
> CREATE TABLE test.user_stopword (value VARCHAR(30) NOT NULL DEFAULT '' ) ENGINE=INNODB DEFAULT CHARSET=utf8;#此处必须为utf8不然会碰到bug
> SET GLOBAL innodb_ft_server_stopword_table='test/user_stopword';
这样的话，

使用全文检索还有以下限制：
1 每张表只能有一个全文检索的索引
2 由多列组合而成的全文检索的索引必须使用相同的字符集与排序规则
3 不支持没有单词界定符delimiter的语言，如中文日文汉语等。

全文检索语法：
MATCH (col1,col2,...) AGAINST (expr )

MATCH指定了需要被查询的列。AGAINST指定了使用何种方法去进行查询。

查询模式有3种：Natural Language 、 Boolean、 Query Expansion

1、Natural Language （默认的全文检索查询模式）

test> SELECT * FROM fts_a WHERE MATCH(body) AGAINST ('Porridge' in natural language mode);
+--------------+----------------------------------------------+
| FTS_DOC_ID | body                         |
|--------------+----------------------------------------------|
|          2 | pease porridge hot,pease porridge cold |
|          1 | pease porridge in the post       |
+--------------+----------------------------------------------+

传统查询：
test> explain extended SELECT * from fts_a where body like '%Porridge%';
+------+---------------+---------+--------+-----------------+--------+-----------+--------+--------+------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra    |
|------+---------------+---------+--------+-----------------+--------+-----------+--------+--------+------------+-------------|
| 1 | SIMPLE    | fts_a | ALL |       <null> | <null> | <null> | <null> |    6 |    100 | Using where |
+------+---------------+---------+--------+-----------------+--------+-----------+--------+--------+------------+-------------+

全文索引查询：
test> explain extended SELECT * from fts_a where match(body) against ('Porridge' in natural language mode);
+------+---------------+---------+----------+-----------------+---------+-----------+--------+--------+------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra    |
|------+---------------+---------+----------+-----------------+---------+-----------+--------+--------+------------+-------------|
| 1 | SIMPLE    | fts_a | fulltext | idx_fts       | idx_fts |       0 | <null> |    1 |    100 | Using where |
+------+---------------+---------+----------+-----------------+---------+-----------+--------+--------+------------+-------------+
可以看到用到了全文索引。

在WHERE条件中使用MATCH函数，查询返回的结果是根据相关性进行排序的，即相关性最高的结果放在第一位。0表示没有任何的相关性。
相关性的计算依赖下面4个条件：
1、word是否在文档中出现
2、word在文档中出现的次数
3、word在索引列中的数量
4、多少个文档包含该word

上面的那个select 查询中，文档2中porridge出现了2次，因此排在上面。

查询相关性的SQL语句：
test> SELECT fts_doc_id,body , match(body) AGAINST ('Porridge' in natural language mode) as Relevance from fts_a ORDER BY Relevance DESC;
+--------------+-------------------------------------------------------------+-------------+
| fts_doc_id | body                                                    | Relevance |
|--------------+-------------------------------------------------------------+-------------|
|       29 | i am porridge ,and you is porridge,and we are all porridge. | 0.271857|
|          2 | pease porridge hot,pease porridge cold                   | 0.181238|
|          1 | pease porridge in the post                               | 0.0906191 |
|          3 | Nine days old                                           | 0       |
|          6 | Nine days old                                           | 0       |
|       27 | I like this days                                        | 0       |
|       28 | hello world                                              | 0       |
+--------------+-------------------------------------------------------------+-------------+

对于全文检索，还需要考虑以下的因素：
查询的word在stopword列中，忽略该字符串的查询。
查询的word的字符长度是否在区间内。(默认是3-84个字符长度)

如：test> SELECT fts_doc_id,body , match(body) AGAINST ('the' in natural language mode) as Relevance from fts_a ORDER BY Relevance desc ;
+--------------+-------------------------------------------------------------+-------------+
| fts_doc_id | body                                                    | Relevance |
|--------------+-------------------------------------------------------------+-------------|
|          1 | pease porridge in the post                               |       0 |
|          2 | pease porridge hot,pease porridge cold                   |       0 |
|          3 | Nine days old                                           |       0 |
|          6 | Nine days old                                           |       0 |
|       27 | I like this days                                        |       0 |
|       28 | hello world                                              |       0 |
+--------------+-------------------------------------------------------------+-------------+

2、Boolean Language

test> SELECT * from fts_a where match(body) against('+Pease - hot' in boolean mode);
+--------------+----------------------------+
| FTS_DOC_ID | body                   |
|--------------+----------------------------|
|          1 | pease porridge in the post |
+--------------+----------------------------+

boolean全文检索支持以下的几种操作符：
官网文档：https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html
1 加号表示必须出现
2 减号表示一定不能出现
3 (no operator) 表示该word是可选的，但是如果出现，其相关性会更高
4 @distance 表示查询的多个单词之间的距离是否在distance之内，distance的单位是单词。这种全文索引的查询也称为proximity Search。
如MATCH(body)AGAINST('"Pease pot"@20' IN BOOLEAN MODE) 表示字符串Pease和pot之间的距离需在20个单词范围内。
5 > 表示出现该单词时增加相关性
6 < 表示出现该单词时降低相关性
7 ~ 表示运行出现该单词，但是出现时相关性为负
8 * 表示以该单词开头的单词，如lik* 可以表示lik、like、likes之类的。
9 " 表示短语。

"例子：
找出有Pease且有hot的文档：
test> SELECT * from fts_a where match(body) against('+Pease +hot' in boolean mode);
+--------------+----------------------------------------+
| FTS_DOC_ID | body                               |
|--------------+----------------------------------------|
|          2 | pease porridge hot,pease porridge cold |
+--------------+----------------------------------------+

找出有Pease或者hot的文档：
test> SELECT * from fts_a where match(body) against('Pease hot' in boolean mode);
+--------------+----------------------------------------+
| FTS_DOC_ID | body                               |
|--------------+----------------------------------------|
|          2 | pease porridge hot,pease porridge cold |
|          1 | pease porridge in the post          |
+--------------+----------------------------------------+

找出2个单词之间距离不超过8的文档：
test> SELECT * from fts_a where match(body) against('"lirulei days" @8' in boolean mode);
+--------------+-----------------------------------------------------+
| FTS_DOC_ID | body                                              |
|--------------+-----------------------------------------------------|
|       31 | i am lirulei, and happy qixi days                |
+--------------+-----------------------------------------------------+

根据是否有单词like或者pot进行相关性统计，并把出现pot的文档的相关性提高。（文档4中虽然有2个like，但是没有pot，因此相关性没有文档1和5高）
test> SELECT fts_doc_id, body,match(body) against('like > pot' in boolean mode) as Relevance from fts_a ORDER BY Relevance desc ;
+--------------+----------------------------------------+-------------+
| fts_doc_id | body                               | Relevance |
|--------------+----------------------------------------+-------------|
|          5 | Some like it the pot                | 1.43142|
|          1 | pease porridge in the pot          | 1.29601|
|          4 | Some like it hot,some like it cold | 0.270814 |
|          7 | I like code days                   | 0.135407 |
|          2 | pease porridge hot,pease porridge cold | 0    |
|          3 | Nine days old                      | 0    |
|          6 | Nine days old                      | 0    |
+--------------+----------------------------------------+-------------+

对上面的查询增加个降低出现some的文档的权重的查询条件：
test> SELECT fts_doc_id, body,match(body) against('like >hot <some' in boolean mode) as Relevance from fts_a ORDER BY Relevance desc ;
+--------------+----------------------------------------+-------------+
| fts_doc_id | body                               | Relevance |
|--------------+----------------------------------------+-------------|
|          2 | pease porridge hot,pease porridge cold | 1.29601|
|          4 | Some like it hot,some like it cold | 1.15884|
|          7 | I like code days                   | 0.135407 |
|          1 | pease porridge in the pot          | 0    |
|          3 | Nine days old                      | 0    |
|          6 | Nine days old                      | 0    |
|          5 | Some like it the pot                | -0.568583 |
+--------------+----------------------------------------+-------------+

找出cod开头的文档：
test> SELECT fts_doc_id, body,match(body) against('cod*' in boolean mode) as Relevance from fts_a ORDER BY Relevance desc ;
+--------------+----------------------------------------+-------------+
| fts_doc_id | body                               | Relevance |
|--------------+----------------------------------------+-------------|
|          7 | I like code days                   | 0.714191 |
|          1 | pease porridge in the pot          | 0    |
|          2 | pease porridge hot,pease porridge cold | 0    |
|          3 | Nine days old                      | 0    |
|          4 | Some like it hot,some like it cold | 0    |
|          5 | Some like it the pot                | 0    |
|          6 | Nine days old                      | 0    |
+--------------+----------------------------------------+-------------+

短语查找（注意要加上引号，不然查询结果是不正确的）
test> SELECT * from fts_a wherematch(body) against('"days old"' in boolean mode);
+--------------+---------------+
| FTS_DOC_ID | body       |
|--------------+---------------|
|          3 | Nine days old |
|          6 | Nine days old |
+--------------+---------------+

3、Query Expansion
参见姜的书上Page247-248
官方文档：https://dev.mysql.com/doc/refman/5.6/en/fulltext-query-expansion.html

页: [1]

运维网's Archiver

InnoDB全文索引基础