从 MySQL 数据库中获取随机数据但不重复数据

问题描述:

我有一个随机产品列表(1000 个),每个产品都有一个 ID,我正在随机提供它们.我希望这些项目不会重复.目前我正在做的是以下内容:

I have a list of random products(1000's) each with an ID and I am bringing them up randomly. I would like that the items are not repeated. Currently what I am doing is the following:

select * from products where product_id <> previous_product_id order by rand() limit 1;

我确保之后不会直接出现相同的产品.重复产品通常比我想要的更早出现(我相信我说这是 生日问题).我不知道以非重复方式获取随机数据的最有效方法是什么.我想到了一种我认为效率极低的方法:

I am ensuring that the same product will not appear directly after. A repeat product usually appears a lot sooner then I would like (I believe I am right in saying this is the birthday problem). I have no idea what is the most effective way to get random data in a non-repeating fashion. I have thought of a way that I assume is highly inefficent:

我会为用户分配一个 id(例如 foo),然后当他们看到一个项目时,将其添加到一个字符串中,该字符串将是 product_id_1 AND product_id_2 AND product_id_3 AND product_id_n.我会用 timestamp(进一步解释) 存储这些数据.

I would assign the user an id (e.g. foo) and then when they have seen an item add it to a string that would be product_id_1 AND product_id_2 AND product_id_3 AND product_id_n. I would store this data with timestamp(explained further on).

+--------------------------------------------------------------------------------------------+                                                                                          
|user_id |timestamp         | product_seen_string                                            |
|--------------------------------------------------------------------------------------------|
|foo     |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+

使用这个 product_seen_string 我会继续添加到看到的产品中(我也会更新时间戳)然后在查询中我会根据 user_id进行第一次查询> 获取此字符串,然后将该返回的字符串添加到获取随机产品的主查询中,如下所示:

With this product_seen_string I would keep adding to the seen products (I would also update the timestamp) and then in the query I would do a first query based on the user_id to obtain this string and then add that returned string to the main query that obtains the random products like so:

select * from products where product_id <> product_id_1 AND product_id_2 AND product_id_3 AND product_id_n order by rand() limit 1;

我还要写到,如果没有返回任何产品,那么数据将被删除,以便循环重新开始.除了有一个每十分钟运行一次以查看时间戳是否比一个小时还早的 cron 作业,我会删除它.

I would also write into that if no products were returned then the data would be deleted so that the cycle can start again. As well as having a cron job that every ten minutes would run to see if the timestamp is older then an hour I would delete it.

我使用的脚本语言是 PHP

The scripting language I am using is PHP

选择随机行总是很棘手,没有不涉及一些妥协的完美解决方案.要么妥协性能,要么妥协甚至随机分布,要么妥协选择重复项的机会,等等.

Selecting random rows is always tricky, and there are no perfect solutions that don't involve some compromise. Either compromise performance, or compromise even random distribution, or compromise on the chance of selecting duplicates, etc.

正如@JakeGould 所提到的,任何使用 ORDER BY RAND() 的解决方案都不能很好地扩展.随着表中的行数越来越大,在文件排序中对整个表进行排序的成本越来越差.Jake 是正确的,当排序顺序是随机时,查询不能被缓存.但我不太关心这个,因为我通常会禁用查询缓存(它有其自身的可扩展性问题).

As @JakeGould mentions, any solution with ORDER BY RAND() does not scale well. As the number of rows in your table gets larger, the cost of sorting the whole table in a filesort gets worse and worse. Jake is correct that the query cannot be cached when the sort order is random. But I don't care about that so much because I usually disable the query cache anyway (it has its own scalability problems).

这是通过创建 rownum 列并分配唯一的连续值来预先随机化表中的行的解决方案:

Here's a solution to pre-randomize the rows in the table, by creating a rownum column and assigning unique consecutive values:

ALTER TABLE products ADD COLUMN rownum INT UNSIGNED, ADD KEY (rownum);
SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();

现在您可以通过索引查找获得随机行,无需排序:

Now you can get a random row by an index lookup, without sorting:

SELECT * FROM products WHERE rownum = 1;

或者你可以得到下一个随机行:

Or you can get the next random row:

SELECT * FROM products WHERE rownum = 2;

或者,您可以一次获得 10 个随机行,或者您想要的任何其他数字,没有重复:

Or you can get 10 random rows at a time, or any other number you want, with no duplicates:

SELECT * FROM products WHERE rownum BETWEEN 11 and 20;

您可以随时重新随机化:

You can re-randomize anytime you want:

SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();

执行随机排序的成本仍然很高,但现在您不必对每个 SELECT 查询都执行此操作.您可以按计划进行,最好是在非高峰时间.

It's still costly to do the random sorting, but now you don't have to do it on every SELECT query. You can do it on a schedule, hopefully at off-peak times.