Adventures in Supercomputing with R

George Ostrouchov (Oak Ridge National Laboratory and University of Tennessee)¹

Charles University, Summer Semester 2022, virtual course through Zoom

Description

This course covers the use of medium to large computing systems with the R language and other software tools for statistical workflows on large data. It includes an overview of hardware and R-related software for such large systems. Statistical topics exercised on these systems include parallel random number generation, bootstrap and crossvalidation, and matrix computation for statistical methods. The class will work on IT4I systems throughout the semester. Concepts will include strategies for fast and efficient R code and for parallel implementations utilizing multicore and multinode approaches.

Objectives

Understanding of parallel hardware and software, and their access from R
Ability to manage multicore and multinode computing from R
Ability to use these technologies for statistical computing and large data processing on large parallel systems
Gaining knowledge when multiple nodes are needed and when faster code or only a few cores are enough

Required Background

Upper level standing in statistics or equivalent (including calculus and matrix algebra)
Intermediate knowledge of statistical computing with R
Other programming experience

Technology Considerations

Many technologies are useful for statistical computing and data science. In this class, we take a narrow and high-level path through these technologies. We learn tools that specifically target working with R on a supercomputer or a generic cluster computer for the purpose of developing code to analyze large data. While the high-level path appears narrow, most other popular technologies are based on the same or similar lower-level concepts that will be discussed.

Supercomputers are unix systems, accessed remotely. Consequently, the first enabler is knowledge of a few unix commands and familiarity with remote access software. This gives access to a uniform platform experience (unix) for the whole class (whether accessing from Windows, Mac, or Linux laptops), except for the initial login access, which can differ.

A second uniformity enabler is the R language along with git version control, which enable a workflow of local code editing and the ability to synchronize with a remote supercomputer. RStudio is an easy way to use both. The git synchronization is also key to software collaborations.

Lectures and exercises (sketch)

One lecture per week, scheduled for 90 minutes each, probably Wednesdays, starting at 17:20. There will be exercises to complete on IT4I systems weekly. I plan another scheduled hour to answer questions regarding the exercises. This can be adjusted as the lectures proceed.

The order of concepts and exercises are subject to change as the lectures proceed. I hope to make the lectures as interactive as possible. I would like to work one or two large data sets in a way that intersects with many lectures and exercises. Some potential data sets are listed at the bottom and I welcome other suggestions.

Introductions, expectation, IT4I accounts, workflow (laptop via git to cluster)
1.1. Exercise: IT4I accounts setup, simple unix, ssh concepts
Overview of parallel hardware
2.1 Exercise: Working over ssh with a single node
Overview of parallel software, interactive vs. batch, scaling concepts
3.1 Exercise: Running interactive or batch, PBS scheduler at IT4I (comment on SLURM)
Timing, benchmarking, profiling R code, and git version control overview
3.1 Exercise: Benchmark R code on your laptop and on a cluster node. Code on laptop, get via git to cluster.
Speeding up your serial R code, converting sections to C/C++
4.1 Exercise: Benchmark code examples
Using multicore parallelism, unix fork, multithreaded BLAS, hyperthreading
1.1 Exercise: PBS (SLURM) Managing multicore and multinode work in R
MPI, Regression case study: parallelizing random forest multicore and multinode 8.1 Exercise: Generate timings and make a scaling graphs of code speedup
Reading data in parallel, CSV, HDF5, ADIOS2 7.1 Exercise: Work with a large data set
Parallel matrix computation via OpenBLAS and ScaLAPACK libraries from R
10.1 Exercise:
Distributed PCA case study: Parallel data ingestion to randomized PCA
9.1 Exercise:
Projects and selected further concepts
11.1 Exercise:
Projects and selected further concepts 12.1 Exercise:

Lectures and exercises will draw from these topics

Computing Concepts:
- Batch and Interactive Computing
- Version control for software development
  - GitLab, GitHub
- Speeding up serial code
  - Vectorize, Compute once (trade memory for computation), use special functions (e.g. rowSum, colSum)
  - Write C/C++ code or CUDA code
- Parallel hardware architectures
  - Cores and Manycores (FPGAs?), Nodes, GPUs, hyperthreading
- Parallel software concepts
  - Amdahl’s Law, strong scaling, weak scaling, time to solution
  - Shared Memory (+thread safety), Distributed Memory (+communication), and Memory hierarchies
  - Libraries (MKL, OpenBLAS, nvBLAS, etc.)
  - Communication (MPI & SPMD, MapReduce, Dataflow)
  - Reading data in parallel
Statistical Methods:
- Random number generation (serial and parallel)
- Bootstrap and crossvalidation
- Matrix computation (Regression)
- Randomized algorithms (rsvd)?
Tools:
- R, RStudio, and packages
- Linux and other unix flavors (just basics)
  - text editors (emacs, vim, ?)
  - file copy, move, permissions, groups, less, cat, pipes, redirection, grep
  - ssh authentication and tunelling
- git code management (plain and via RStudio)
- PBS or SLURM job scheduling on clusters
Parallel R Tools
- parallel package
- pbdR packages on GitHub
Data is to be determined. Possibilities:
- Yellow taxi NYC data (100+ GB in ~200 MB monthly chunks)
- Czech data
- Geoportal Praha
- I welcome suggestions of other large data!

Syllabus text produced from a .Rmd file and rendered to html via Knit in RStudio↩︎