Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.
Using Apache Spark on RCC Resources
Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.
Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh extension. First, download the example file
#!/bin/bash #SBATCH -N 2 #SBATCH -t 01:00:00 #SBATCH --ntasks-per-node 3 #SBATCH --cpus-per-task 5 # Load the spark module module load spark # Start the spark cluster spark-start.sh echo $MASTER # (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor pi.py
spark module will set up the necessary environment variables and the
spark-start.sh script will set up the spark cluster within the job allocation.
Submit your script using the following command, replacing
YOURSCRIPT with the name of your script file:
$ sbatch YOURSCRIPT.sh
For more information about Apache Spark, please refer to the official documentation.