Pyspark: How To Deal With Null Values In Python User Defined Functions
I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python module
Solution 1:
We will modify a little bit your code and it should works fine :
@udf(DoubleType())defjaro_winkler(s1, s2):
ifnotall((s1, s2)): # or, if None in (s1, s2):
out = 0else:
out = jellyfish.jaro_winkler(s1, s2)
return out
defjaro_winkler_func(df, column_left, column_right):
df = df.withColumn("test", jaro_winkler(df[column_left], df[column_right]))
return df
Depending on the expected behavior, you need to change the test :
if not all((s1, s2)):
will return 0 for bothnull
and empty string''
.if None in (s1, s2):
will return 0 only fornull
Post a Comment for "Pyspark: How To Deal With Null Values In Python User Defined Functions"