2021/2022, summer semester:
Parallel matrix computations

Charles University, Prague.




The goal of this course is to introduce parallel processing of basic computational cores that can be encountered in mathematical modeling as well as in the general context of scientific computing. These computational cores include, for example, basic parallel operations with dense and sparse matrices, parallelism in sparse direct methods as well as parallel preconditioning of Krylov space methods. The course includes also elementary introduction into multigrid (not in the text at this moment) and domain decomposition methods.

Basic course text (2021/2022)


Lectures are based on the extended syllabus Parallel Matrix Computations (Gentle intro into HPC) [pdf]. This text can be (will be) slightly modified during the course. It can happen that the course will be run partially online and there exist slides prepared for this, as mentioned below. These slides may be used during a part of the lectures. Similarly, also the slides may be (this is highly probably) updated during the run of the course.

Lectures 2021/2022 - updated as they go


  • 15.02.2022: Introduction.
            -- The slides for the whole course will be stored at [ppslides.pdf]. They will be updated (prolonged) during the course
  • 22.02.2022: Parallel computing. Classification of parallel architectures. Hardware and software components. Uniprocessor model
           
  • 29.2.2022: Interconnection network. Static and dynamic interconnect. Routing and switching. Time models: Speedup, efficiency, Amdahl's law
            -- Strong and weak scalability. Uniprocessor model: hiding latency by cache, multithreading, prefetch.
  • 1.3.2022: Vector processing model. Examples of good and bad vectorization.
            --
  • 8.3.2022: Multiprocessor model. Important additional items: granularity, decomposition and load balancing.
            --
  • 15.3.2022: Multiprocessor model. Parallelizing (and vectorizing) some code constructs. In particular, some iterative methods with fine granularity, linear recurrences. FFT.
            --
  • 22.3.2022: Monte Carlo method - parallelization, measuring communication and computation - "hypercube assumption"
            -- dense matrix-vector multiplication, dense matrix-matrix multiplication
  • 29.3.2022: Parallelizing recurrencies
  • 5.4.2022: Dense matrix-matrix multiplication: Cannon algorithm, SUMMA, parallel Gaussian elimination (standard and pipelined)
            -- extreme parallelism to solve linear algebraic systems
  • 12.4.2022: Graph partitioning (continuation, started 5.4.)
  • 19.4.2022: Parallel linear algebraic solvers.
            --
  • 26.4.2022: Parallel linear algebraic solvers (continued).
            --
  • 3.5.2022: Parallel linear algebraic solvers (continued).
            --
  • 10.5.2022: Parallel preconditioning.
            --
  • 17.5.2022: Repetion, summary, exams.
            --
  • Examples of questions for the exam


  • Uniprocessor model of computation and the ways to hide latency. Locality. Regularity. (slides pages cca 74-87, terminology from 1-73)
  • Vectorized computations (slides pages cca 88-128, terminology)
  • Matrix-matrix multiplication versus cache size (three ways to do the computation) (slides pages cca 162-200, terminology)
  • Matrix-vector multiplication and its cost efficiency (slides pages cca 163-182, terminology)
  • Matrix-matrix multiplication and its cost efficiency (slides pages cca 183-200, terminology)
  • Cannon algorithm and SUMMA. (slides pages cca 186-200, terminology)
  • Graph partitioning (slides pages cca 276-326, terminology)
  • Parallel aspects of the linear algebraic solvers (factorization, parallelizable components of iterative solvers, synchronization points). (slides pages cca 327-375, terminology)
  • Parallel preconditioning - ideas. (slides pages cca from 378, terminology)

    Schedule from the previous year


  • Introduction into parallel computation and HPC.
  • Uniprocessor model.
  • Vectorization model.
  • Vectorization model II.
  • SELF-STUDY: starting at the page 50 down
            -- LAPACK and LU - blocks
            -- LAPACK and QR - showing that also QR can be formulated via with blocks
            -- an example how the amount of available cache is related to the choice of algorithm - here matrix-matrix multiplication: three cases
                -- only a very small cache
                -- slightly larger cache
                -- reasonable cache
            -- note that prevailing BLAS3 processing is based on a cache of reasonable size
            -- Multiprocessor model:
                -- shared/distributed memory
                -- blocking, communication
                -- granularity
                    -- contemporary computers (as we have discussed) can process even very fine grain tasks
                    -- still coarse grain often preferable
                -- problem decomposition
                    -- various classifications
                    -- processing should be balanced
                    -- most of the later examples use static load balancing
                    -- cyclic ordering to get better balancing on "nonsymmetric processes"
                    -- -- matrix x matrix is "symmetric"
                    -- -- Gaussian elimination is "nonsymmetric"
                -- standardizations: communication libraries, linear algebra libraries
                -- -- BLACS, PBLAS, ScaLapack, PLASMA, MAGMA
                -- programming patterns
  • Logarithmic troubles: Recurrences. Recursive doubling. Cyclic reduction. FFT.
            -- Pointwise Jacobi, Gauss-Seidel
            -- Branches
            -- Recurrences
            -- Recursive doubling
            -- Cyclic reduction
            -- Parallel prefix
            -- Horner's scheme and Estrin's improvement
            -- FFT
            -- Monte Carlo
  • Starting at page 79: Dense matrix-vector operations
            -- Measuring computation and communication: there is an assumption on the communication that enables some quantitative estimates.
            -- AXPY
            -- Dot product
            -- Rowwise 1D partitioning for Matvec
            -- Block rowwise 1D partitioning for Matvec
            -- 2D partitioning for Matvec
            -- Block 2D partitining for Matvec
            -- Matmats
            -- Cannon algorithm
            -- SUMMA
            -- Gaussian elimination
            -- Extreme parallelism
            -- Strassen's approach
  • Graph partitioning.
  • Sparse solvers of linear algebraic equations in parallel environment.
            -- since now we use https://cesnet.zoom.us/j/906206832?pwd=U2FDSWluVVlERWIxTm81MWFoRks2UT09
  • Standard and Parallel preconditioning.
            -- Text for this part is in Czech: chapter 15 of [pdf]
  • Standard and Parallel preconditioning II.
            -- Text for this part is in Czech: chapter 15 of [pdf]
  • Standard and Parallel preconditioning III.
            -- Text for this part is in Czech: chapter 15 of [pdf]
  • Standard and Parallel preconditioning IV.
            -- Text for this part is in Czech: chapter 15 of [pdf]
            -- Finally finalized :-): chapter 15 of [pdf]
            -- Starting simple intro into the domain decomposition
  • Domain decomposition.


    Additional resources (2021/2022)


  • W.L. Briggs, Van Emden Henson and S.F. McCormick: A Multigrid Tutorial, 2nd edition, SIAM, PA, 2000
  • T.A. Davis, Direct methods for sparse linear systems, SIAM, Philadelphia, PA, 2006.
  • J.J. Dongarra, I.S. Duff, D.C. and H.A. van der Vorst, Numerical Linear Algebra for High-Performance Computers, SIAM, PA, 1998;
  • A. Grama, A. Gupta, G. Karypis, V. Kumar: Introduction to Parallel Computing, 2nd edition, Addison-Wesley, 2003
  • M.A. Olshanskii. E.E. Tyrtyshnikov: Iterative Methods for Linear Systems. Theory and Applications, SIAM, Philadelphia, PA, 2014.
  • T. Rauber, G. Runger: Parallel Programming for Multicore and Cluster Systems, 2nd edition, Springer, Berlin-Heidelberg, 2012.
  • A. Toselli and O. Widlund: Domain Decomposition Methods - Algorithms and Theory, Springer, 2004.