像GROUP BY AND HAVING这样的SQL

问题描述：

我想获取满足一定条件的组数。用SQL术语，我想在Elasticsearch中执行以下操作。

I want to get the counts of groups which satisfy a certain condition. In SQL terms, I want to do the following in Elasticsearch.

SELECT COUNT(*) FROM
(
   SELECT
    senderResellerId,
    SUM(requestAmountValue) AS t_amount
   FROM
    transactions
   GROUP BY
    senderResellerId
   HAVING
    t_amount > 10000 ) AS dum;

到目前为止，我可以按术语汇总将senderResellerId分组。但是，当我应用过滤器时，它无法按预期工作。

So far, I could group by senderResellerId by term aggregation. But when I apply filters, it does not work as expected.

弹性请求

{
  "aggregations": {
    "reseller_sale_sum": {
      "aggs": {
        "sales": {
          "aggregations": {
            "reseller_sale": {
              "sum": {
                "field": "requestAmountValue"
              }
            }
          }, 
          "filter": {
            "range": {
              "reseller_sale": { 
                "gte": 10000
              }
            }
          }
        }
      }, 
      "terms": {
        "field": "senderResellerId", 
        "order": {
          "sales>reseller_sale": "desc"
        }, 
        "size": 5
      }
    }
  }, 
  "ext": {}, 
  "query": {  "match_all": {} }, 
  "size": 0
}

实际响应

{
  "took" : 21,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 150824,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "reseller_sale_sum" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 149609,
      "buckets" : [
        {
          "key" : "RES0000000004",
          "doc_count" : 8,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000005",
          "doc_count" : 39,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000006",
          "doc_count" : 57,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000007",
          "doc_count" : 134,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        }
          }
        }
      ]
    }
  }
}

从上面的响应中可以看到，它正在返还转售商，但是 reseller_sale 汇总结果为零。

As you can see from above response, it is returning resellers but the reseller_sale aggregation is zero in results.

更多详细信息是此处。

答

喜欢的行为

您可以使用 管道聚合 ，即 bucket选择器聚合。查询看起来像这样：

Implementation of HAVING-like behavior

You may use one of the pipeline aggregations, namely bucket selector aggregation. The query would look like this:

POST my_index/tdrs/_search
{
   "aggregations": {
      "reseller_sale_sum": {
         "aggregations": {
            "sales": {
               "sum": {
                  "field": "requestAmountValue"
               }
            },
            "max_sales": {
               "bucket_selector": {
                  "buckets_path": {
                     "var1": "sales"
                  },
                  "script": "params.var1 > 10000"
               }
            }
         },
         "terms": {
            "field": "senderResellerId",
            "order": {
               "sales": "desc"
            },
            "size": 5
         }
      }
   },
   "size": 0
}

将以下文档放入索引后：

After putting the following documents in the index:

  "hits": [
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh5F-dSw48Z0DWDys",
        "_score": 1,
        "_source": {
           "requestAmountValue": 7000,
           "senderResellerId": "ID_1"
        }
     },
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh684dSw48Z0DWDyt",
        "_score": 1,
        "_source": {
           "requestAmountValue": 5000,
           "senderResellerId": "ID_1"
        }
     },
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh8TBdSw48Z0DWDyu",
        "_score": 1,
        "_source": {
           "requestAmountValue": 1000,
           "senderResellerId": "ID_2"
        }
     }
  ]

查询的结果是：

"aggregations": {
      "reseller_sale_sum": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ID_1",
               "doc_count": 2,
               "sales": {
                  "value": 12000
               }
            }
         ]
      }
   }

即仅那些累计销售额为> 10000 的 senderResellerId 。

I.e. only those senderResellerId whose cumulative sales are >10000.

要实现等效于 SELECT COUNT（*）FROM（... HAVING）的人，可以使用存储桶脚本聚合与求和桶聚合。尽管似乎没有直接方法可以计算 bucket_selector 实际选择了多少个存储桶，但我们可以定义 bucket_script 根据条件产生 0 或 1 和 sum_bucket 产生其和：

To implement an equivalent of SELECT COUNT(*) FROM (... HAVING) one may use a combination of bucket script aggregation with sum bucket aggregation. Though there seems to be no direct way to count how many buckets did bucket_selector actually select, we may define a bucket_script that produces 0 or 1 depending on a condition, and sum_bucket that produces its sum:

POST my_index/tdrs/_search
{
   "aggregations": {
      "reseller_sale_sum": {
         "aggregations": {
            "sales": {
               "sum": {
                  "field": "requestAmountValue"
               }
            },
            "max_sales": {
               "bucket_script": {
                  "buckets_path": {
                     "var1": "sales"
                  },
                  "script": "if (params.var1 > 10000) { 1 } else { 0 }"
               }
            }
         },
         "terms": {
            "field": "senderResellerId",
            "order": {
               "sales": "desc"
            }
         }
      },
      "max_sales_stats": {
         "sum_bucket": {
            "buckets_path": "reseller_sale_sum>max_sales"
         }
      }
   },
   "size": 0
}

输出将是：

   "aggregations": {
      "reseller_sale_sum": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            ...
         ]
      },
      "max_sales_stats": {
         "value": 1
      }
   }

所需的存储区计数位于 max_sales_stats.value 。

The desired bucket count is located in max_sales_stats.value.

我必须指出两件事：

此功能是实验性的（从ES 5.6开始，它仍是实验性的，尽管它是在 2.0.0-beta1 。）

管道聚合为应用于先前聚合的结果：

管道聚合作用于其他聚合而不是
的输出

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.

这意味着 bucket_selector 聚合将被添加到输出树中。在 senderResellerId 上的条款聚合之后和结果上应用。例如，如果 senderResellerId 比项的 size 个多>汇总定义，您将不会获得所有集合中 sum（sales）> 10000 ，但仅出现在条款聚合的输出中的那些。考虑使用排序和/或设置足够的 size 参数。

This means that bucket_selector aggregation will be applied after and on the result of terms aggregation on senderResellerId. For example, if there are more senderResellerId than size of terms aggregation defines, you will not get all the ids in the collection with sum(sales) > 10000, but only those that appear in the output of terms aggregation. Consider using sorting and/or set sufficient size parameter.

这也适用于第二种情况， COUNT （）（... HAVING），它将仅计算聚合输出中实际存在的存储桶。

This also applies for the second case, COUNT() (... HAVING), which will only count those buckets that are actually present in the output of aggregation.

如果此查询过重或存储桶数量太大，请考虑去规范化您的数据或将此总和直接存储在文档中，因此您可以使用普通的 范围 查询以实现您的目标。

In case this query is too heavy or the number of buckets too big, consider denormalizing your data or store this sum directly in the document, so you can use plain range query to achieve your goal.

像GROUP BY AND HAVING这样的SQL

喜欢的行为

Implementation of HAVING-like behavior

相关推荐