postgresql:偏移量+限制变得非常慢

问题描述:

我有一个表tmp_drop_ids,其中有一个列,id和330万个条目.我想遍历表,每200个条目执行一次操作.我有以下代码:

I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:

LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
    print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
    query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
    SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
    cursor.execute(query)

起初,它运行良好(〜0.15秒即可生成tmp表),但有时会变慢,例如大约30万张门票开始生成这个tmp表需要11到12秒的时间,大约40万张.基本上看来是不可靠的.

This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.

我将在其他查询中使用这些ID,因此我认为最好将它们放在tmp表中.有没有更好的方法可以遍历这样的结果?

I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?

请改用游标.使用OFFSET和LIMIT非常昂贵-因为pg必须执行查询,处理和跳过OFFSET行.偏移量就像跳过行"一样,很昂贵.

Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.

光标文档

Cursor允许对一个查询进行迭代.

Cursor allows a iteration over one query.

BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;

您可能无需显式使用DECLARE语句就可以使用服务器端游标,而只需 psycopg (有关服务器端游标的搜索部分).

Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).