Skip to content Skip to sidebar Skip to footer

Uploading Custom Schema From A Csv File Using Pyspark

I have a query about loading the schema onto cdsw using pyspark. I have a dataframe which is created using a csv file data_1 = spark.read.csv('demo.csv',sep = ',', header = True, i

Solution 1:

Read with custom schema so that u can define what exact datatype you wanted.

schema = StructType([ \
            StructField("COl1",StringType(),True), \
            StructField("COL2",DecimalType(20,10),True), \
            StructField("COL3",DecimalType(20,10),True)
        ])

        df = spark.read.schema(schema).csv(file_path)

Solution 2:

You can load the schema.csv and build an actual schema programmatically, then use it to load actual data.

Notes: The types in schema.csv must match with Spark datatypes

import pandas as pd
from pyspark.sql.types import *

# schema.csv# variable,data_type# V1,Double# V2,String# V3,Double# V4,Integer# data.csv# V1,V2,V3,V4# 1.2,a,3.4,5

dtypes = pd.read_csv('schema.csv').to_records(index=False).tolist()
fields = [T.StructField(dtype[0], globals()[f'{dtype[1]}Type']()) for dtype in dtypes]
schema = StructType(fields)

df = spark.read.csv('data.csv', header=True, schema=schema)

df.printSchema()
# root#  |-- V1: double (nullable = true)#  |-- V2: string (nullable = true)#  |-- V3: double (nullable = true)#  |-- V4: integer (nullable = true)

df.show()
# +---+---+---+---+# | V1| V2| V3| V4|# +---+---+---+---+# |1.2|  a|3.4|  5|# +---+---+---+---+

Post a Comment for "Uploading Custom Schema From A Csv File Using Pyspark"