通过在Livy上提交批POST方法并跟踪作业来使用Airflow触发作业提交
我想使用Airflow来编排工作,包括运行一些猪脚本,shell脚本和Spark作业。
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
主要用于Spark作业,我想使用Apache Livy但不确定是否使用或运行spark-submit是个好主意。
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
即使我提交了使用Airflow跟踪Spark作业的最佳方法是什么? >
What is best way to track Spark job using Airflow if even I submitted?
我假设您是包含 Java的应用程序
/ JAR
Scala
您要提交到远程 Spark
集群的代码。在比较其他可能性时, Livy
可以说是远程 spark-submit
的最佳选择:
My assumption is you an application JAR
containing Java
/ Scala
code that you want to submit to remote Spark
cluster. Livy
is arguably the best option for remote spark-submit
when evaluated against other possibilities:
- 指定远程
master
IP :需要修改全局配置/环境变量 - 使用
SSHOperator
:SSH
连接可能会中断 - 使用
EmrAddStepsOperator
:取决于EMR
-
Specifying remote
master
IP: Requires modifying global configurations / environment variables -
Using
SSHOperator
:SSH
connection might break -
Using
EmrAddStepsOperator
: Dependent onEMR
关于跟踪
-
Livy
仅报告状态
,而不报告进度(阶段完成百分比) - 如果可以的话,您可以通过 poll
- 服务器
REST
API并在控制台中保留打印日志,这些日志将显示在WebUI的任务日志中(查看日志
) -
Livy
only reportsstate
and not progress (% completion of stages) - If your'e OK with that, you can just poll the
Livy
server viaREST
API and keep printing logs in console, those will appear on task logs in WebUI (View Logs
)
其他注意事项
-
Livy
不支持为POST /批次
请求重用SparkSession
- 如果必须这样做,则必须在
PySpark
中编写应用程序代码,并使用POST /会话
请求
-
Livy
doesn't support reusingSparkSession
forPOST/batches
request - If that's imperative, you'll have to write your application code in
PySpark
and usePOST/session
requests
参考文献
- 如何从Airflow向EMR群集提交Spark作业?
-
livy / examples / pi_app
-
rssanders3 / livy_spark_operator_python_example
>
- How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
有用的链接
- How to submit Spark jobs to EMR cluster from Airflow?
- Remote spark-submit to YARN running on EMR