Skip to content Skip to sidebar Skip to footer
Showing posts with the label Apache Spark

Pyspark 2.1: Importing Module With Udf's Breaks Hive Connectivity

I'm currently working with Spark 2.1 and have a main script that calls a helper module that con… Read more Pyspark 2.1: Importing Module With Udf's Breaks Hive Connectivity

Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space

I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error:… Read more Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space

How To Improve The Performance Of A Merge Operation With An Incremental Deltalake Table?

I am specifically looking to optimize performance by updating and inserting data to a DeltaLake bas… Read more How To Improve The Performance Of A Merge Operation With An Incremental Deltalake Table?

Pyspark Error With Udf: Py4j.py4jexception: Method __getnewargs__([]) Does Not Exist Error

I am trying to solve the following error (I am using the databricks platform and spark 2.0) tweets_… Read more Pyspark Error With Udf: Py4j.py4jexception: Method __getnewargs__([]) Does Not Exist Error

Pyspark Add New Column Field With The Data Frame Row Number

Hy, I'm trying build a recommendation system with Spark I have a data frame with users email an… Read more Pyspark Add New Column Field With The Data Frame Row Number

How To Assign A String Variable To A Dataframe Name

I had a problem, which is a for loop program.like below: list = [1,2,3,4] for index in list: n… Read more How To Assign A String Variable To A Dataframe Name

Spark - Set Null When Column Not Exist In Dataframe

I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B… Read more Spark - Set Null When Column Not Exist In Dataframe

What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for … Read more What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?