Improve Speed Of Spark App
Solution 1:
First you should figure out what's actually taking the most amount of time.
For example determine how long just reading the data takes
axes = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="axes", keyspace=source)
.load()
.count()
Increasing the parallelism or number of parallel readers may help this but only if you aren't maxing out the IO of your Cassandra Cluster.
Second, see if you can do everything with the Dataframes api. Every-time you use a python lambda you are incurring serialization costs between the python and scala types.
Edit:
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load().repartition(number)
Will only take effect after the load has completed so this won't help you.
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source,numPartitions=number).load()
Is not a valid parameter for the Spark Cassandra Connector so this won't do anything.
See https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters Input Split Size determines how many C* partitions to put in a Spark Partition.
Post a Comment for "Improve Speed Of Spark App"