Uploading Custom Schema From A Csv File Using Pyspark
I have a query about loading the schema onto cdsw using pyspark. I have a dataframe which is created using a csv file data_1 = spark.read.csv('demo.csv',sep = ',', header = True, i
Solution 1:
Read with custom schema so that u can define what exact datatype you wanted.
schema = StructType([ \
StructField("COl1",StringType(),True), \
StructField("COL2",DecimalType(20,10),True), \
StructField("COL3",DecimalType(20,10),True)
])
df = spark.read.schema(schema).csv(file_path)
Solution 2:
You can load the schema.csv
and build an actual schema programmatically, then use it to load actual data.
Notes: The types in schema.csv
must match with Spark datatypes
import pandas as pd
from pyspark.sql.types import *
# schema.csv# variable,data_type# V1,Double# V2,String# V3,Double# V4,Integer# data.csv# V1,V2,V3,V4# 1.2,a,3.4,5
dtypes = pd.read_csv('schema.csv').to_records(index=False).tolist()
fields = [T.StructField(dtype[0], globals()[f'{dtype[1]}Type']()) for dtype in dtypes]
schema = StructType(fields)
df = spark.read.csv('data.csv', header=True, schema=schema)
df.printSchema()
# root# |-- V1: double (nullable = true)# |-- V2: string (nullable = true)# |-- V3: double (nullable = true)# |-- V4: integer (nullable = true)
df.show()
# +---+---+---+---+# | V1| V2| V3| V4|# +---+---+---+---+# |1.2| a|3.4| 5|# +---+---+---+---+
Post a Comment for "Uploading Custom Schema From A Csv File Using Pyspark"