Skip to content Skip to sidebar Skip to footer

Pyspark: How To Deal With Null Values In Python User Defined Functions

I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python module

Solution 1:

We will modify a little bit your code and it should works fine :

@udf(DoubleType())defjaro_winkler(s1, s2):
    ifnotall((s1, s2)):  # or, if None in (s1, s2):
        out = 0else:
        out = jellyfish.jaro_winkler(s1, s2)
    return out


defjaro_winkler_func(df, column_left, column_right):
    df = df.withColumn("test", jaro_winkler(df[column_left], df[column_right]))
    return df

Depending on the expected behavior, you need to change the test :

  • if not all((s1, s2)): will return 0 for both null and empty string ''.
  • if None in (s1, s2): will return 0 only for null

Post a Comment for "Pyspark: How To Deal With Null Values In Python User Defined Functions"