Apache Spark
Apache Spark
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.
Using Apache Spark on RCC Resources
Spark supports the Python (PySpark), R (SparkR), Java, and Scala programming languages. There are a number of official examples available for how to use Spark in Python, Java, and Scala.
Below is an example Slurm script to submit a Spark job. The script must be saved with the .sh extension. First, download the example file
here.pi.py
#!/bin/bash
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5
# Load the spark module
module load spark
# Start the spark cluster
spark-start.sh
echo $MASTER
# (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor pi.py
The spark
module will set up the necessary environment variables and the spark-start.sh
script will set up the spark cluster within the job allocation.
Submit your script using the following command, replacing YOURSCRIPT
with the name of your script file:
$ sbatch YOURSCRIPT.sh
For more information about Apache Spark, please refer to the official documentation.