如何使 Apache Airflow 中的 DAG 像简单的 cron 作业一样运行?
Airflow 调度程序在过去的几天里让我摸不着头脑,因为它甚至在 catchup=False
之后回填了 dag 运行.我的时区感知 dag 的开始日期为 13-04-2021 19:30 PST
或 14-04-2021 2:30 UTC
并具有以下配置:
Airflow scheduler kinda left me scratching my head for the past few days as it backfills dag runs even after catchup=False
.
My timezone-aware dag has a start date of 13-04-2021 19:30 PST
or 14-04-2021 2:30 UTC
and has the following configuration:
# define DAG and its parameters
dag = DAG(
'backup_dag',
default_args=default_args,
start_date=pendulum.datetime(2021, 4, 13, 19, 30, tz='US/Pacific'), # set start_date in US/Pacific (PST) timezone
description='A data backup pipeline',
schedule_interval="30 19 * * *", # 7:30 PM every day
catchup=False,
is_paused_upon_creation=False
)
此 dag 在边缘设备上运行,该边缘设备有时打开有时关闭.我希望这个 dag 基本上安排在 19:30 PST
或 2:30 UTC
运行,只要边缘设备打开,否则不要.奇怪的是,当我将带有 dag 的容器部署到边缘设备时,dag 会自动在预定时间间隔之外开始其第一次运行,即使该时间间隔已经过去!
This dag runs on an edge device, that edge device is sometimes on and sometimes off. I want this dag to basically schedule its run at 19:30 PST
or 2:30 UTC
, whenever the edge device is on, otherwise don't. The weird thing is that when I deploy the container with the dag to the edge device the dag automatically starts its first run outside the scheduled interval, even though that interval has passed!
我在这里错过了什么?我无法理解调度程序为什么要这样做
What am I missing here? I can't wrap my head around why the scheduler is doing this
以下是我阅读所有文档后的理解,如果我错了,请纠正我.
Following is my understanding after reading all the documentation, please do correct me if I'm wrong.
调度程序在 2021-04-19T011:30:00+00:00 UTC
获取 DAG,理想情况下它应该在 2021-04-20T02:30:00+00 运行:00 UTC
根据 dag 配置.以下所有时间均为 UTC
DAG picked up by scheduler at 2021-04-19T011:30:00+00:00 UTC
, ideally it should run at 2021-04-20T02:30:00+00:00 UTC
according to the dag config. All times below are in UTC
Dag Start_date 1st run(skip catchup=false) 2nd run(skip catchup=false) 3rd run(skip catchup=false) 4th run(skip catchup=false)
2021-04-14T02:30:00+00:00 ---> 2021-04-15T02:30:00+00:00 ---> 2021-04-16T02:30:00+00:00 ---> 2021-04-17T02:30:00+00:00 ---> 2021-04-18T02:30:00+00:00 --->
5th run(skip catchup=false) 6th run(should execute)
2021-04-19T02:30:00+00:00 ---> 2021-04-20T02:30:00+00:00
那么,为什么在 2021-04-18T02:30:00+00:00
到 2021-04-19T02:30:00+00 区间内进行第 5 次运行:00
即使间隔已过?
So, why is the 5th run taking place for interval 2021-04-18T02:30:00+00:00
to 2021-04-19T02:30:00+00:00
even though the interval has passed?
我希望 DAG 仅在其间隔到来时运行.
I want the DAG to only run when its interval has come.
这是预期的气流行为:
关闭追赶.[...] 关闭时,调度程序仅在最近的时间间隔内创建 DAG 运行.
turn catchup off. [...] When turned off, the scheduler creates a DAG run only for the latest interval.
Catchup 部分中的相应示例 与您的相似,并更详细地解释了行为.
The corresponding example in the Catchup section is similar to yours and explains the behavior in more detail.
我能想到的一个肮脏的解决方法是设置 schedule_interval=None
并使用 CLI 从 cron 实际触发 DAG.
A dirty workaround of which I can think is to set the schedule_interval=None
and actually trigger the DAG from cron using CLI.