设为首页 收藏本站
查看: 527|回复: 0

[经验分享] MySQL使用rand获取随机记录的性能优化问题

[复制链接]

尚未签到

发表于 2016-10-19 01:13:04 | 显示全部楼层 |阅读模式
  对MySQL rand方法随机获取记录的性能优化问题,讲解很到位的一篇文章:
  
If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.

SELECT name
FROM random
ORDER BY RAND()
LIMIT 1;
This example works fine and is fast if you only when let’s say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don’t forget: we only sort to throw nearly all the rows away.
I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.
For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.

move the work into the application

First idea: We can simplify the whole job if we calculate the ID beforehand in the application.

SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>
As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.
The first SELECT is a NO-OP and is optimized away. The second is aeq_refagainst a constant value and also very fast.

move the job into the database

But is it really necessary to do it in the application ? Can’t we do it in the database ?

# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# oops, this is a double, we need an int
> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# better. But how is the performance:
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.
Ok, now we know how to generate the random ID, but how to get the row ?

> EXPLAIN
SELECT name
FROM random
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | random | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+
NO, NO, NO.Don’t go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.
We need a way to make sure that the random-id is only generated once:

SELECT name
FROM random JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | random     | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.
No Sorting, No Application, Most parts of the query optimized away.

adding holes to the numbers

To generalize the last solution we add the possibility of holes, like when youDELETErows.

SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (theLIMIT 1). And we read the rows according to the index (ORDER BY id ASC). As we are using>=instead of a=we can get rid of a theCEILand get the same result with a little less work.

Equal Distribution

As soon as the distribution of the IDs is not equal anymore our selection of rows isn’t really random either.

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+
TheRANDfunction is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.
There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+
Therow_idis now free of holes again and we can run our random query again:

SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);
After 1000 fetches we see a equal distribution again:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+
Maintaining the holes

Let’s take the tables as before:

DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
id SERIAL,
name VARCHAR(32) NOT NULL UNIQUE
);
DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
id SERIAL,
r2_id bigint unsigned NOT NULL UNIQUE
);
When ever we change something inr2we want to thatr2_equi_distis updated too.

DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
AFTER INSERT ON r2 FOR EACH ROW
BEGIN
DECLARE m BIGINT UNSIGNED DEFAULT 1;
SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
SELECT IFNULL(m, 1) INTO m;
INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;
DELETE FROM r2;
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  2 | a09a3959d68187ce48f4fe7e388926a9 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     2 |
|  3 |     3 |
|  4 |     4 |
+----+-------+
INSERTis quite simple, onDELETEwe have to update the equi-dist-id to maintain the hole-free setup:

DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
AFTER DELETE ON r2 FOR EACH ROW
BEGIN
DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;
DELETE FROM r2 WHERE id = 2;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |     4 |
+----+-------+
UPDATEis straight-forward again. We only have to maintain the Foreign Key constraint:

DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;
UPDATE r2 SET id = 25 WHERE id = 4;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |    25 |
+----+-------+
Multiple Rows at once

If you want to get more than one row returned, you can:


  • execute the Query several times
  • write a stored procedure which is executing the query and stores the result in a temp-table
  • make a UNION

a stored procedure

Stored procedures provide you with the structures you know from your favourite programming language:


  • loops
  • control structures
  • procedures * …

For this task we only need a LOOP:

DELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS rands;
CREATE TEMPORARY TABLE rands ( rand_id INT );
loop_me: LOOP
IF cnt < 1 THEN
LEAVE loop_me;
END IF;
INSERT INTO rands
SELECT r1.id
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
SET cnt = cnt - 1;
END LOOP loop_me;
END$$
DELIMITER ;
CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
|  133716 |
|  702643 |
|  112066 |
|  452400 |
+---------+
I leave to the reader to fix the issues:


  • use dynamic SQL and pass in the name of the temporary table
  • use a UNIQUE index on the table and catch the UNIQUE key violation to remove the possible duplicates in the result-set

Performance

Now let’s see what happends to our performance. We have 3 different queries for solving our problems.


  • Q1. ORDER BY RAND()
  • Q2. RAND() * MAX(ID)
  • Q3. RAND() * MAX(ID) + ORDER BY ID

Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.
The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.

   100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.
A more detailed analysis of those queries is atanalyzing-complex-queries.


原文链接:
http://jan.kneschke.de/projects/mysql/order-by-rand/

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-287940-1-1.html 上篇帖子: Mysql避免全表扫描sql查询优化 . 下篇帖子: mysql查询 分组后前 N条数据
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表