Pyspark Add New Column Field With The Data Frame Row Number
Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([['aa@gmail.com',2,3],['aa@gmail.com',5,5]
Solution 1:
Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer
could be a better choice:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)
Post a Comment for "Pyspark Add New Column Field With The Data Frame Row Number"