Apache Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data alalytics.

Spark supports python (pyspark), R (sparkR), java, and scala. Examples for python, java, and scala can be found in Apache documetation (https://spark.apache.org/examples.html). Following is a sample Slurm script to submit a Spark job. the example code can be found in https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

#!/bin/bash

#SBATCH -N 2

#SBATCH -t 01:00:00

#SBATCH --ntasks-per-node 3

#SBATCH --cpus-per-task 5

module load spark

spark-start

echo $MASTER

spark-submit --total-executor-cores 30 --executor pi.py

The spark-start script will set up the spark cluster within the job allocation and the spark module will set up the environment variables necessary.

Version
2.2.1
Software Category