从CSV文件中拆分JSON值并基于Spark/Scala中的JSON键创建新列

问题描述:

在CSV文件中具有以下格式的数据.想要从Desc列中拆分JSON并使用key创建一个新列.将spark 2与Sc​​ala结合使用.

Have data in CSV file below is the format. Want to split JSON from Desc column and create a new column with key.Using spark 2 with Scala.

+------+------------+----------------------------------+
|  id  |  Category  |           Desc                   |
+------+------------+----------------------------------+
|  201 |  MIS20     | { "Total": 200,"Defective": 21 } |
+------+-----------------------------------------------+
|  202 |  MIS30     | { "Total": 740,"Defective": 58 } |
+------+-----------------------------------------------+

输出:

So the desired output would be:

+------+------------+---------+-------------+
|  id  |  Category  |  Total  |  Defective  |
+------+------------+---------+-------------+
|  201 |  MIS20     |  200    |   21        |
+------+----------------------+-------------+
|  202 |  MIS30     |  740    |   58        | 
+------+------------------------------------+

我们非常感谢您的帮助.

Any help is highly appreciated.

为内部json创建一个schema,并使用下面的from_json函数应用该架构

Create a schema for your inner json and apply that schema with from_json function as below

val schema = new StructType()
  .add(StructField("Total", LongType, false)).
  add("Defective", LongType, false)

d.select($"id",$"Category", from_json($"Desc", schema).as("desc"))
  .select($"id",$"Category", $"desc.*")
  .show(false)

输出:

+---+--------+-----+---------+
|id |Category|Total|Defective|
+---+--------+-----+---------+
|201|MIS20   |200  |21       |
|202|MIS30   |740  |58       |
+---+--------+-----+---------+

希望这会有所帮助!