Apache Spark如何处理python多线程问题?

问题描述：

根据 python的GIL ，我们无法在CPU绑定的进程中使用线程，因此我的问题是Apache Spark如何在多核心环境?

According to python's GIL we cannot use threading in CPU bound processes so my question is how does Apache Spark utilize python in multi-core environment?

答

多线程python问题与Apache Spark内部结构分开. Spark上的并行处理是在JVM内部处理的.

Multi-threading python issues are separated from Apache Spark internals. Parallelism on Spark is dealt with inside the JVM.

原因是在Python驱动程序中，SparkContext使用Py4J启动JVM并创建JavaSparkContext.

And the reason is that in the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

Py4J仅在驱动程序上用于Python和Java SparkContext对象之间的本地通信.大型数据传输是通过不同的机制进行的.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD转换映射到Java中的PythonRDD对象的转换.在远程工作者计算机上，PythonRDD对象启动Python子进程并使用管道与它们进行通信，从而发送用户的代码和要处理的数据.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python sub-processes and communicate with them using pipes, sending the user's code and the data to be processed.

PS:我不确定这是否能完全回答您的问题.

PS: I'm not sure if this actually answers your question completely.

Apache Spark如何处理python多线程问题?

相关推荐