Apache Beam:触发固定窗口

问题描述:

根据以下文档,据说如果您没有明确指定触发器,您将获得如下描述的行为:

According to following documentation, it is stated that if you don't explicitly specify a trigger you get behavior described below:

如果未指定,默认行为是当水印通过窗口的末尾,然后每隔一段时间再次触发数据迟到的时间.

If unspecified, the default behavior is to trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data.

FixedWindow 是否也有这种行为?例如,您会假设固定窗口应该具有在水印通过窗口结束后重复触发的默认触发器,并丢弃所有延迟数据,除非明确处理延迟数据.另外我可以在源代码的哪里看到触发器的定义,例如 FixedWindow 对象?

Is this behavior true for FixedWindow as well? For example you would assume fixed window should have a default trigger of repeatedly firing after watermark passes end of window, and discard all late data unless late data is explicitly handled. Also where in the source code can I see definition of trigger for, example FixedWindow object?

最好的入门文档是 触发器windows(并遵循那里的链接).特别是,它说,即使每次延迟数据到达时都会触发默认触发器,但在默认配置中,它仍然有效地仅触发一次,丢弃延迟数据:

The best doc to start with is the guide for triggers, and windows (and following the links from there). In particular, it says that, even though the default trigger fires every time late data arrives, in default configuration it still effectively only triggers once, discarding the late data:

如果您同时使用默认窗口配置和默认触发器,默认触发器只发出一次,延迟数据被丢弃.这是因为默认窗口配置具有允许的延迟值为 0.请参阅处理延迟数据部分以了解有关修改此行为的信息.

if you are using both the default windowing configuration and the default trigger, the default trigger emits exactly once, and late data is discarded. This is because the default windowing configuration has an allowed lateness value of 0. See the Handling Late Data section for information about modifying this behavior.

详情

Beam 中的窗口概念一般包含很少的事情,包括分配窗口、处理触发器、处理延迟数据和其他一些事情.然而,这些事情是分开分配和处理的.从这里开始很快就会变得混乱.

Windowing concept in Beam in general encompasses few things, including assigning windows, handling triggers, handling late data and few other things. However these things are assigned and handled separately. It gets confusing quickly from here.

如何将元素分配给窗口由 WindowFn 处理,见这里.例如FixedWindows:链接.它基本上是唯一发生在那里(几乎)的事情.分配窗口是基于事件时间戳(有点)对元素进行分组的一种特殊情况.您可以认为逻辑类似于根据时间戳手动为元素分配自定义键,然后应用 GroupByKey.

How the elements are assigned to a window is handled by a WindowFn, see here. For example FixedWindows: link. It is basically the only thing that happens there (almost). Assigning a window is a special case of grouping the elements based on the event timestamps (kinda). You can think of the logic being similar to manually assigning custom keys to elements based on the timestamps, and then applying GroupByKey.

触发是一个相关但独立的概念.触发器(粗略地)只是用于指示 runner 何时被允许发出窗口中累积的数据的谓词().我认为这是最接近触发器的原始设计文档的内容:https://s.apache.org/光束触发器

Triggering is a related but separate concept. Triggers are (roughly) just predicates to indicate when the runner is allowed to emit the data accumulated in the window so far (source). I think this is the closest thing to the original design doc for triggers: https://s.apache.org/beam-triggers

延迟是配置的另一个相关部分,它也有些独立(link).即使触发器可能允许运行器永远发出所有迟到的数据,管道也可以设置为不允许任何迟到的数据(这是默认行为),或者只允许在有限的时间内出现迟到的数据.这会导致上述默认触发行为.是的,这令人困惑.如果可以,请避免使用任何复杂的触发和延迟,它可能不会像您预期的那样工作.

Lateness is another related part of the configuration which is also somewhat separate (link). Even though a trigger might allow the runner to emit all the late data forever, the pipeline can be set to not allow any late data (which is the default behavior), or only allow late data for some limited time. This leads to the default trigger behavior described above. Yes, this is confusing. Avoid using any complex triggering and lateness if you can, it likely won't work as you expect it to.

所以窗口类只处理分组逻辑,即什么样的元素具有相同的分组键.这些类不关心您何时想要发出累积结果.这取决于您的业务逻辑,例如您可能想要处理新到达的元素,或者您可能想要丢弃它们,它不是窗口的一部分.这意味着 FixedWindows 或其他窗口没有特殊的触发器,您可以对任何窗口使用任何触发器(即使逻辑上某些特定触发器在某些窗口的上下文中没有意义).

So the window classes only handle the grouping logic, i.e. what kind of elements have the same grouping key. These classes don't care about when you will want to emit the accumulated results. This depends on your business logic, e.g. you might want to handle newly arrived elements, or you might want to discard them, it's not part of the window. This means there's no special triggers for FixedWindows or other windows, you can use any trigger with any window (even if logically some specific trigger doesn't make sense in context of some window).

默认触发器 就是这样,只是默认设置的东西.如果它不适合您的需要,您应该分配自己的触发器.它可能不会,除了一些基本用例.

Default trigger is just that, something that is just set by default. You should assign your own trigger if it doesn't suit your needs. And it likely won't, except for some basic use cases.

更新

示例如何使用 FixedWindows 和触发器.

An example of how to use FixedWindows with triggers.