PostreSQL取出每组第一条(最高)记录(6种方法 ) Select first row in each GROUP BY group? PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON () Queries Results(性能)

PostreSQL取出每组第一条(最高)记录(6种方法 )
Select first row in each GROUP BY group?
PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()
Queries
Results(性能)

stackflow上面的一个问题。用窗口函数比较简单,但是那些没有窗口函数的数据库怎么办?

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1
  1.  
    WITH summary AS (
  2.  
    SELECT p.id,
  3.  
    p.customer,
  4.  
    p.total,
  5.  
    ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS ranks
  6.  
    FROM PURCHASES p)
  7.  
    SELECT s.*
  8.  
    FROM summary s
  9.  
    WHERE s.ranks = 1

所以给出通用方法:

  1.  
    SELECT MIN(x.id), -- change to MAX if you want the highest
  2.  
    x.customer,
  3.  
    x.total
  4.  
    FROM PURCHASES x
  5.  
    JOIN (SELECT p.customer,
  6.  
    MAX(total) AS max_total
  7.  
    FROM PURCHASES p
  8.  
    GROUP BY p.customer) y ON y.customer = x.customer
  9.  
    AND y.max_total = x.total
  10.  
    GROUP BY x.customer, x.total

PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()

  1.  
    SELECT DISTINCT ON (customer)
  2.  
    id, customer, total
  3.  
    FROM purchases
  4.  
    ORDER BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

  1.  
    SELECT DISTINCT ON (2)
  2.  
    id, customer, total
  3.  
    FROM purchases
  4.  
    ORDER BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

  1.  
    ...
  2.  
    ORDER BY customer, total DESC NULLS LAST, id;

If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. 

--如果total可以为空,则最可能希望具有最大非空值的行。最后添加空值。具体可参照:

其实有点不明白distinct on,看前辈的博客点击打开链接。还用了IN 子查询

 DISTINCT ON ( expression [, …] )把记录根据[, …]的值进行分组,分组之后仅返回每一组的第一行。需要注意的是,如果你不指定ORDER BY子句,返回的第一条的不确定的。如果你使用了ORDER BY 子句,那么[, …]里面的值必须靠近ORDER BY子句的最左边。 

1. 当没用指定ORDER BY子句的时候返回的记录是不确定的。

  1.  
    postgres=# select distinct on(course)id,name,course,score from student;
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    10 | 周星驰 | 化学 | 83
  5.  
    8 | 周星驰 | 外语 | 88
  6.  
    2 | 周润发 | 数学 | 99
  7.  
    14 | 黎明 | 物理 | 90
  8.  
    6 | 周星驰 | 语文 | 91
  9.  
    (5 rows)

2. 获取每门课程的最高分

  1.  
    postgres=# select distinct on(course)id,name,course,score from student order by course,score desc;
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    5 | 周润发 | 化学 | 87
  5.  
    13 | 黎明 | 外语 | 95
  6.  
    2 | 周润发 | 数学 | 99
  7.  
    14 | 黎明 | 物理 | 90
  8.  
    6 | 周星驰 | 语文 | 91
  9.  
    (5 rows)

3. 如果指定ORDER BY 必须把分组的字段放在最左边

  1.  
    postgres=# select distinct on(course)id,name,course,score from student order by score desc;
  2.  
    ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
  3.  
    LINE 1: select distinct on(course)id,name,course,score from student ...

4. 获取每门课程的最高分同样可以使用IN子句来实现

  1.  
    postgres=# select * from student where(course,score) in(select course,max(score) from student group by course);
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    2 | 周润发 | 数学 | 99
  5.  
    5 | 周润发 | 化学 | 87
  6.  
    6 | 周星驰 | 语文 | 91
  7.  
    13 | 黎明 | 外语 | 95
  8.  
    14 | 黎明 | 物理 | 90
  9.  
    (5 rows)

原文还提到 在 row_number() over(), distinct on和in子句之间有一个小区别  ,主要是因为前两个方法是用行号,且行号唯一。解决办法就是用rank()窗口函数,让同成绩的行号出现重复。

下面是一位大神提供的6种方法,有些需要在PG中实现。而且这个大神的建表语句也让我学习了。

Queries

1. row_number() in CTE, (see other answer) 公用表达式

  1.  
    WITH cte AS (
  2.  
    SELECT id, customer_id, total
  3.  
    , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
  4.  
    FROM purchases
  5.  
    )
  6.  
    SELECT id, customer_id, total
  7.  
    FROM cte
  8.  
    WHERE rn = 1;

2. row_number() in subquery (my optimization) 子查询

  1.  
    SELECT id, customer_id, total
  2.  
    FROM (
  3.  
    SELECT id, customer_id, total
  4.  
    , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
  5.  
    FROM purchases
  6.  
    ) sub
  7.  
    WHERE rn = 1;

3. DISTINCT ON (see other answer

  1.  
    SELECT DISTINCT ON (customer_id)
  2.  
    id, customer_id, total
  3.  
    FROM purchases
  4.  
    ORDER BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)  递归和LATERAL

第一个查询取customer_id最小,且该id中total最大的。

PostreSQL取出每组第一条(最高)记录(6种方法 )
Select first row in each GROUP BY group?
PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()
Queries
Results(性能)

在FROM 或者JOIN子句的子查询里面可以关联查询FROM子句或者JOIN子句的另一边的子句或者表.

见这一篇→点击打开链接

  1.  
    WITH RECURSIVE cte AS (
  2.  
    ( -- parentheses required
  3.  
    SELECT id, customer_id, total
  4.  
    FROM purchases
  5.  
    ORDER BY customer_id, total DESC
  6.  
    LIMIT 1
  7.  
    )
  8.  
    UNION ALL
  9.  
    SELECT u.*
  10.  
    FROM cte c
  11.  
    , LATERAL (
  12.  
    SELECT id, customer_id, total
  13.  
    FROM purchases
  14.  
    WHERE customer_id > c.customer_id -- lateral reference
  15.  
    ORDER BY customer_id, total DESC
  16.  
    LIMIT 1
  17.  
    ) u
  18.  
    )
  19.  
    SELECT id, customer_id, total
  20.  
    FROM cte
  21.  
    ORDER BY customer_id;

5. customer table with LATERAL (see here)

  1.  
    SELECT l.*
  2.  
    FROM customer c
  3.  
    , LATERAL (
  4.  
    SELECT id, customer_id, total
  5.  
    FROM purchases
  6.  
    WHERE customer_id = c.customer_id -- lateral reference
  7.  
    ORDER BY total DESC
  8.  
    LIMIT 1
  9.  
    ) l;

6. array_agg() with ORDER BY (see other answer)

  1.  
    SELECT (array_agg(id ORDER BY total DESC))[1] AS id
  2.  
    , customer_id
  3.  
    , max(total) AS total
  4.  
    FROM purchases
  5.  
    GROUP BY customer_id;

PostreSQL取出每组第一条(最高)记录(6种方法 )
Select first row in each GROUP BY group?
PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()
Queries
Results(性能)

挺有意思的,数组中按照每组total降序,取第一个(MAX)。第一次见到分组字段不是放在第一个的。最后用MAX函数取出最大的值。 但是这里出现了customer_id的 自动升序,看了些帖子。
PostreSQL取出每组第一条(最高)记录(6种方法 )
Select first row in each GROUP BY group?
PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()
Queries
Results(性能)

Results(性能)

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

  1.  
    1. 273.274 ms
  2.  
    2. 194.572 ms
  3.  
    3. 111.067 ms
  4.  
    4. 92.922 ms
  5.  
    5. 37.679 ms -- winner
  6.  
    6. 189.495 ms

B. The same with Postgres 9.5

  1.  
    1. 288.006 ms
  2.  
    2. 223.032 ms
  3.  
    3. 107.074 ms
  4.  
    4. 78.032 ms
  5.  
    5. 33.944 ms -- winner
  6.  
    6. 211.540 ms

C. Same as B., but with ~ 2.3 rows per customer_id

  1.  
    1. 381.573 ms
  2.  
    2. 311.976 ms
  3.  
    3. 124.074 ms -- winner
  4.  
    4. 710.631 ms
  5.  
    5. 311.976 ms
  6.  
    6. 421.679 ms
 

参考资料:https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral