Apache Spark

Software Category
Version
2.2.1

Apache Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark code-base was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data analytics.

Spark supports python (pyspark), R (sparkR), java, and scala. Examples for python, java, and scala can be found in Apache documentation (https://spark.apache.org/examples.html). Following is a sample Slurm script to submit a Spark job. the example code can be found in https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5

# Load the spark module
module load spark

# Start the spark cluster
spark-start.sh
echo $MASTER

# (2 nodes * 5 cpus-per-task * 3 tasks-per-node) = 30 total cores
spark-submit --total-executor-cores 30 --executor pi.py

The spark-start script will set up the spark cluster within the job allocation and the spark module will set up the environment variables necessary.