


I have to read a csv file from hdfs, then I need to apply the logic that every column is padded to a fixed width then I need store back to hdfs as fixed width file only. Not in any other form example csv or parquet.


If I read a input from hdfs as csv that look like below example:

Name, age, phonenumber
A, 25,9900999999
B, 26,7654890234
C, 27,5643217897


Then I need apply logic on each column with fixed width like first column width should set as 15, 2nd column 3, 3rd as 10


Output should look like this in hdfs.

Name      age   phonenumber           
A         25    9900999999
B         26    7654890234
C         27    5643217897


Then that fixed width data I need to write it to hdfs as fixed width file format.


You need to cast all columns as string, if inferSchema is already used. Map the length to the df.columns, so that you can handle this dynamically. Check this out:

scala> val df = Seq(("A", 25,9900999999L),("B", 26,7654890234L),("C", 27,5643217897L)).toDF("Name","age","phonenumber")
df: org.apache.spark.sql.DataFrame = [Name: string, age: int ... 1 more field]

scala> df.show(false)
|A   |25 |9900999999 |
|B   |26 |7654890234 |
|C   |27 |5643217897 |

scala> val widths = Array(5,3,10)
widths: Array[Int] = Array(5, 3, 10)

scala> df.columns.zip(widths)
res235: Array[(String, Int)] = Array((Name,5), (age,3), (phonenumber,10))

scala> df.columns.zip(widths).foldLeft(df){ (acc,x) => acc.withColumn(x._1,rpad( trim(col(x._1).cast("string")),x._2," ")) }.show(false)
|Name |age|phonenumber|
|A    |25 |9900999999 |
|B    |26 |7654890234 |
|C    |27 |5643217897 |


scala> df.columns.zip(widths).foldLeft(df){ (acc,x) => acc.withColumn(x._1,rpad( trim(col(x._1).cast("string")),x._2,"-")) }.show(false)
|Name |age|phonenumber|
|A----|25-|9900999999 |
|B----|26-|7654890234 |
|C----|27-|5643217897 |
