Hive通过查询获取组中前n个记录

问题描述:

我在配置单元中有以下表格

I have following table in hive

用户名,用户名,用户地址,点击次数,展示次数,页面ID,页面名称

user-id, user-name, user-address,clicks,impressions,page-id,page-name

我需要通过点击每个页面[page-id,page-name]找出前5名用户[用户名,用户名,用户地址]

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

据我所知,我们需要先按[page-id,page-name]进行分组,然后在每组中我想通过[clicks,impression]命令排序,然后只发出前5个用户[用户名,用户名,用户地址],但我发现很难构建查询。

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

我们如何使用HIve UDF?

How can we do this using HIve UDF ?

您可以使用此处描述的rank()UDF执行此操作: http://ragrawal.wordpress.com/2011/11/18/extract-top -n-records-in-each-group-in-hadoophive /

You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(user-id) as rank, clicks
    FROM mytable
    DISTRIBUTE BY page-id, user-id
    SORT BY page-id, user-id, clicks desc
) a 
WHERE rank < 5
ORDER BY page-id, rank