在PostgreSQL中根据频率获取单词在日期上的排名

问题描述：

我有一个存储Twitter数据的数据库:

I have a database that stores twitter data:

        Create Table tweet(
            ID BIGINT UNIQUE,
            user_ID BIGINT,
            created_at TIMESTAMPTZ,
            tweet TEXT;

我正在尝试编写一个查询，该查询将遍历 tweet 中的单词，以获取所有单词的出现频率，并返回前十位最频繁出现的单词以及单词的排名每个日期.

I'm trying to write a query that goes through the words in tweet for all rows gets the frequency of each word, and returns the top ten most frequent words along with the words' ranking over each date.

示例:

("word1":[1,20,22,23,24,25,26,27,28,29,30,29,28,27,26,25,26,27,28,29,30,29,28,29,28,27,28,29,30,30,...],
'word2' [...])

我当前的查询获得了前十个单词，但是在获取每天这些单词的排名时遇到了一些麻烦.

My current query gets the top ten words, but I am having some trouble getting the rankings of those words for each day.

当前查询:

    SELECT word, count(*)
    FROM (
        SELECT regexp_split_to_table(
            regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'), '\s+')
        AS word
    FROM tweet
    ) t
    GROUP BY word
    ORDER BY count(*) DESC
    LIMIT 10;

哪个返回:

[('vaccine', 286669),
 ('covid', 213857),
 ('yum', 141345),
 ('pfizer', 39532),
 ('people', 28960),
 ('beer', 27117),
 ('say', 24569),
 ('virus', 23682),
 ('want', 21988),
 ('foo', 19823)]

答

如果您希望每天获得前10名，则可以执行以下操作:

If you want the top 10 per day, you can do:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from tweet t
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) t
where rn <= 10
order by created_day, rn desc

在PostgreSQL中根据频率获取单词在日期上的排名

相关推荐