Hadoop, HPC, and Spear at the Department of Statistics

Casey McLaughlin
March, 2014

Try to imagine a spreadsheet with a billion rows and 60,000 columns.

Adrian Barbu at the Florida State University Department of Statistics regularly works with data this large. Dr. Barbu creates algorithms and statistical methodologies for a variety of applications. Repeatedly running his algorithms over the immense matrix of numbers takes a large amount of time.

The algorithms that Dr. Barbu is working on have a wide variety of practical applications, including face detection and recognition, machine learning, and even the evaluation of breast cancer treatments. “I am currently collaborating with Dr. Jinfeng Zhang for our department, who is applying my algorithm to predict response to different chemotherapy treatments based on gene expression levels, age, and other predictors,” states Dr. Barbu.

Some evaluations can be done in a reasonable period of time using standard desktop computers, but many cannot. So, Dr. Barbu and his team make use of several resources at the Research Computing Center (RCC). Two such systems include the RCC Spear cluster, which is used to run MATLAB jobs in parallel, and the High Performance Computing cluster, used to run custom C++ code over many different data sets.

In addition, the RCC and Dr. Barbu are collaborating to evaluate the implementation of Hadoop, open-source software for storing and processing large-scale data sets, at Florida State. “One of my students is evaluating a parallel version of my algorithm on a Hadoop cluster.“ As a result of this collaboration, the RCC aims to create a Hadoop offering for general use at Florida State in the near future.