How To Use Matplotlib To Plot Pyspark Sql Results

March 26, 2024 Post a Comment

I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot.

Solution 1:

I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from

pyspark.sql import Row
from pyspark.sql import HiveContext
import pyspark
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 
sc = pyspark.SparkContext()
sqlContext = HiveContext(sc)
test_list = [(1, 'hasan'),(2, 'nana'),(3, 'dad'),(4, 'mon')]
rdd = sc.parallelize(test_list)
people = rdd.map(lambda x: Row(id=int(x[0]), name=x[1]))
schemaPeople = sqlContext.createDataFrame(people)
# Register it as a temp table
sqlContext.registerDataFrameAsTable(schemaPeople, "test_table")
df1=sqlContext.sql("Select * from test_table")
pdf1=df1.toPandas()
pdf1.plot(kind='barh',x='name',y='id',colormap='winter_r')

Solution 2:

For small data, you can use .select() and .collect() on the pyspark DataFrame. collect will give a python list of pyspark.sql.types.Row, which can be indexed. From there you can plot using matplotlib without Pandas, however using Pandas dataframes with df.toPandas() is probably easier.

lacucinadiadine

How To Use Matplotlib To Plot Pyspark Sql Results

Solution 1:

Solution 2:

Post a Comment for "How To Use Matplotlib To Plot Pyspark Sql Results"

Widget HTML #3