Skip to content Skip to sidebar Skip to footer

Improve Speed Of Spark App

This is part of my python-spark code which parts of it run too slow for my needs. Especially this part of the code, which I would really like to improve it's speed but don't know h

Solution 1:

First you should figure out what's actually taking the most amount of time.

For example determine how long just reading the data takes

axes = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(table="axes", keyspace=source)
  .load()
  .count()

Increasing the parallelism or number of parallel readers may help this but only if you aren't maxing out the IO of your Cassandra Cluster.

Second, see if you can do everything with the Dataframes api. Every-time you use a python lambda you are incurring serialization costs between the python and scala types.

Edit:

sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load().repartition(number) 

Will only take effect after the load has completed so this won't help you.

sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source,numPartitions=number).load()

Is not a valid parameter for the Spark Cassandra Connector so this won't do anything.

See https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters Input Split Size determines how many C* partitions to put in a Spark Partition.

Post a Comment for "Improve Speed Of Spark App"