Skip to content Skip to sidebar Skip to footer

Pyspark Add New Column Field With The Data Frame Row Number

Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([['aa@gmail.com',2,3],['aa@gmail.com',5,5]

Solution 1:

Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)

Post a Comment for "Pyspark Add New Column Field With The Data Frame Row Number"