

我只是想知道RDDDataFrame (Spark 2.0.0 DataFrame是Dataset[Row]的纯类型别名)之间的区别是什么??

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark?


Can you convert one to the other?

通过Google搜索"DataFrame definition",很好地定义了DataFrame:

A DataFrame is defined well with a google search for "DataFrame definition":

数据帧是表格或二维数组状结构, 其中每一列包含对一个变量的度量,每一行包含 包含一个案例.

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.


So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

另一方面,RDD仅仅是一个 R 弹性 D 分配的 D 资产集,它更像是一个黑匣子.无法对其进行优化的数据不受约束.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

但是,您可以通过rdd方法从DataFrame转到RDD,并且可以通过RDDRDDDataFrame(如果RDD为表格格式). c10>方法

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method


In general it is recommended to use a DataFrame where possible due to the built in query optimization.