Skip to content Skip to sidebar Skip to footer

Spark: How To Transform To Data Frame Data From Multiple Nested Xml Files With Attributes

How to transform values below from multiple XML files to spark data frame : attribute Id0 from Level_0 Date/Value from Level_4 Required output: +----------------+-------------+--

Solution 1:

You can use Level_0 as the rowTag, and explode the relevant arrays/structs:

import pyspark.sql.functions as F

df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')

df2 = df.select(
    '_Id0', 
    F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
    '_Id0',
    'Level_4.*'
)

df2.show()
+---------------+----------+-----+
|           _Id0|      Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01|  4_1|
|Id0_value_file1|2021-01-02|  4_2|
+---------------+----------+-----+

Post a Comment for "Spark: How To Transform To Data Frame Data From Multiple Nested Xml Files With Attributes"