MySQL使用rand获取随机记录的性能优化问题

fumingxia · 发表于 2016-10-19 01:13:04

　　对MySQL rand方法随机获取记录的性能优化问题，讲解很到位的一篇文章：
　　
If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.

SELECT name
FROM random
ORDER BY RAND()
LIMIT 1;
This example works fine and is fast if you only when let’s say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don’t forget: we only sort to throw nearly all the rows away.
I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.
For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.

move the work into the application

First idea: We can simplify the whole job if we calculate the ID beforehand in the application.

SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>
As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.
The first SELECT is a NO-OP and is optimized away. The second is aeq_refagainst a constant value and also very fast.

move the job into the database

But is it really necessary to do it in the application ? Can’t we do it in the database ?

# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# oops, this is a double, we need an int
> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                   1000000  |
+-------------------------+
# better. But how is the performance:
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra    |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE    | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                      |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY    | NULL  | NULL | NULL | No tables used             |
|  2 | SUBQUERY | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.
Ok, now we know how to generate the random ID, but how to get the row ?

> EXPLAIN
SELECT name
FROM random
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra                      |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY    | random | ALL  | NULL       | NULL | NULL | NULL | 1000000 | Using where                |
|  3 | SUBQUERY | NULL | NULL | NULL       | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                               |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+
NO, NO, NO.Don’t go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.
We need a way to make sure that the random-id is only generated once:

SELECT name
FROM random JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table    | type | rows | Extra                      |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY    | <derived2> | system | 1 |                            |
|  1 | PRIMARY    | random    | const  | 1 |                            |
|  2 | DERIVED    | NULL    | NULL | NULL | No tables used             |
|  3 | SUBQUERY | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.
No Sorting, No Application, Most parts of the query optimized away.

adding holes to the numbers

To generalize the last solution we add the possibility of holes, like when youDELETErows.

SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table    | type | rows | Extra                      |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY    | <derived2> | system | 1 |                            |
|  1 | PRIMARY    | r1       | range  |  689 | Using where                |
|  2 | DERIVED    | NULL    | NULL | NULL | No tables used             |
|  3 | SUBQUERY | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (theLIMIT 1). And we read the rows according to the index (ORDER BY id ASC). As we are using>=instead of a=we can get rid of a theCEILand get the same result with a little less work.

Equal Distribution

As soon as the distribution of the IDs is not equal anymore our selection of rows isn’t really random either.

> select * from holes;
+----+----------------------------------+----------+
| id | name                            | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |    107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |    50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |    132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |    230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |    481 |
+----+----------------------------------+----------+
TheRANDfunction is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.
There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|    1 |       1 |
|    2 |       2 |
|    3 |       4 |
|    4 |       8 |
|    5 |       16 |
+--------+-----------+
Therow_idis now free of holes again and we can run our random query again:

SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);
After 1000 fetches we see a equal distribution again:

> select * from holes;
+----+----------------------------------+----------+
| id | name                            | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |    222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |    187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |    195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |    207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |    189 |
+----+----------------------------------+----------+
Maintaining the holes

Let’s take the tables as before:

DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
id SERIAL,
name VARCHAR(32) NOT NULL UNIQUE
);
DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
id SERIAL,
r2_id bigint unsigned NOT NULL UNIQUE
);
When ever we change something inr2we want to thatr2_equi_distis updated too.

DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
AFTER INSERT ON r2 FOR EACH ROW
BEGIN
DECLARE m BIGINT UNSIGNED DEFAULT 1;
SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
SELECT IFNULL(m, 1) INTO m;
INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;
DELETE FROM r2;
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
SELECT * FROM r2;
+----+----------------------------------+
| id | name                            |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  2 | a09a3959d68187ce48f4fe7e388926a9 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
+----+-------+
INSERTis quite simple, onDELETEwe have to update the equi-dist-id to maintain the hole-free setup:

DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
AFTER DELETE ON r2 FOR EACH ROW
BEGIN
DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;
DELETE FROM r2 WHERE id = 2;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                            |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |    1 |
|  2 |    3 |
|  3 |    4 |
+----+-------+
UPDATEis straight-forward again. We only have to maintain the Foreign Key constraint:

DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;
UPDATE r2 SET id = 25 WHERE id = 4;
SELECT * FROM r2;
+----+----------------------------------+
| id | name                            |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |    1 |
|  2 |    3 |
|  3 | 25 |
+----+-------+
Multiple Rows at once

If you want to get more than one row returned, you can:

execute the Query several times
write a stored procedure which is executing the query and stores the result in a temp-table
make a UNION

a stored procedure

Stored procedures provide you with the structures you know from your favourite programming language:

loops
control structures
procedures * …

For this task we only need a LOOP:

DELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS rands;
CREATE TEMPORARY TABLE rands ( rand_id INT );
loop_me: LOOP
IF cnt < 1 THEN
LEAVE loop_me;
END IF;
INSERT INTO rands
SELECT r1.id
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
SET cnt = cnt - 1;
END LOOP loop_me;
END$$
DELIMITER ;
CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
|  133716 |
|  702643 |
|  112066 |
|  452400 |
+---------+
I leave to the reader to fix the issues:

use dynamic SQL and pass in the name of the temporary table
use a UNIQUE index on the table and catch the UNIQUE key violation to remove the possible duplicates in the result-set

Performance

Now let’s see what happends to our performance. We have 3 different queries for solving our problems.

Q1. ORDER BY RAND()
Q2. RAND() * MAX(ID)
Q3. RAND() * MAX(ID) + ORDER BY ID

Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.
The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.

100       1.000    10.000    100.000 1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s 0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s 0:00.637s
As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.
A more detailed analysis of those queries is atanalyzing-complex-queries.

原文链接：
http://jan.kneschke.de/projects/mysql/order-by-rand/

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] MySQL使用rand获取随机记录的性能优化问题

浏览过的版块

扫码加入运维网微信交流群