Apache Spark

Software Category

Apache Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.

Using Apache Spark on RCC Resources

Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.

Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh extension. First, download the example file pi.py here.

#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5

# Load the spark module
module load spark

# Start the spark cluster
echo $MASTER

# (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor pi.py

The spark module will set up the necessary environment variables and the spark-start.sh script will set up the spark cluster within the job allocation.

Submit your script using the following command, replacing YOURSCRIPT with the name of your script file:

$ sbatch YOURSCRIPT.sh

For more information about Apache Spark, please refer to the official documentation.