2021/2022, summer semester:
Parallel matrix computations
Charles University, Prague.
The goal of this course is to introduce parallel processing of basic computational cores
that can be encountered in mathematical modeling as well as in the general context of scientific computing.
These computational cores include, for example, basic parallel operations with dense and sparse matrices, parallelism in
sparse direct methods as well as parallel
preconditioning of Krylov space methods.
The course includes also elementary introduction into multigrid (not in the text at this moment)
and domain decomposition methods.
Basic course text (2021/2022)
Lectures are based on the extended syllabus Parallel Matrix Computations (Gentle intro into HPC)
[pdf]. This text can be (will be) slightly modified during the course. It can happen that the course will be run
partially online and there exist slides prepared for this, as mentioned below.
These slides may be used during a part of the lectures. Similarly, also the slides may be (this is highly probably) updated
during the run of the course.
Lectures 2021/2022 - updated as they go
15.02.2022: Introduction.
        -- The slides for the whole course will be stored at [ppslides.pdf].
They will be updated (prolonged) during the course
22.02.2022: Parallel computing. Classification of parallel architectures. Hardware and software components. Uniprocessor model
       
29.2.2022: Interconnection network. Static and dynamic interconnect. Routing and switching. Time models: Speedup, efficiency, Amdahl's law
        -- Strong and weak scalability. Uniprocessor model: hiding latency by cache, multithreading, prefetch.
1.3.2022: Vector processing model. Examples of good and bad vectorization.
        --
8.3.2022: Multiprocessor model. Important additional items: granularity, decomposition and load balancing.
        --
15.3.2022: Multiprocessor model. Parallelizing (and vectorizing) some code constructs. In particular, some iterative methods
with fine granularity, linear recurrences. FFT.
        --
22.3.2022: Monte Carlo method - parallelization, measuring communication and computation - "hypercube assumption"
        -- dense matrix-vector multiplication, dense matrix-matrix multiplication
29.3.2022: Parallelizing recurrencies
5.4.2022: Dense matrix-matrix multiplication: Cannon algorithm, SUMMA, parallel Gaussian elimination (standard and pipelined)
        -- extreme parallelism to solve linear algebraic systems
12.4.2022: Graph partitioning (continuation, started 5.4.)
19.4.2022: Parallel linear algebraic solvers.
        --
26.4.2022: Parallel linear algebraic solvers (continued).
        --
3.5.2022: Parallel linear algebraic solvers (continued).
        --
10.5.2022: Parallel preconditioning.
        --
17.5.2022: Repetion, summary, exams.
        --
Examples of questions for the exam
Uniprocessor model of computation and the ways to hide latency. Locality. Regularity. (slides pages cca 74-87, terminology from 1-73)
Vectorized computations (slides pages cca 88-128, terminology)
Matrix-matrix multiplication versus cache size (three ways to do the computation) (slides pages cca 162-200, terminology)
Matrix-vector multiplication and its cost efficiency (slides pages cca 163-182, terminology)
Matrix-matrix multiplication and its cost efficiency (slides pages cca 183-200, terminology)
Cannon algorithm and SUMMA. (slides pages cca 186-200, terminology)
Graph partitioning (slides pages cca 276-326, terminology)
Parallel aspects of the linear algebraic solvers (factorization, parallelizable components of iterative solvers,
synchronization points). (slides pages cca 327-375, terminology)
Parallel preconditioning - ideas. (slides pages cca from 378, terminology)
Schedule from the previous year
Introduction into parallel computation and HPC.
Uniprocessor model.
Vectorization model.
Vectorization model II.
SELF-STUDY: starting at the page 50 down
        -- LAPACK and LU - blocks
        -- LAPACK and QR - showing that also QR can be formulated via with blocks
        -- an example how the amount of available cache is related to the choice
of algorithm - here matrix-matrix multiplication: three cases
            -- only a very small cache
            -- slightly larger cache
            -- reasonable cache
        -- note that prevailing BLAS3 processing is based on a cache of reasonable size
        -- Multiprocessor model:
            -- shared/distributed memory
            -- blocking, communication
            -- granularity
                -- contemporary computers (as we have discussed) can process even very fine grain tasks
                -- still coarse grain often preferable
            -- problem decomposition
                -- various classifications
                -- processing should be balanced
                -- most of the later examples use static load balancing
                -- cyclic ordering to get better balancing on "nonsymmetric processes"
                -- -- matrix x matrix is "symmetric"
                -- -- Gaussian elimination is "nonsymmetric"
            -- standardizations: communication libraries, linear algebra libraries
            -- -- BLACS, PBLAS, ScaLapack, PLASMA, MAGMA
            -- programming patterns
Logarithmic troubles: Recurrences. Recursive doubling. Cyclic reduction. FFT.
        -- Pointwise Jacobi, Gauss-Seidel
        -- Branches
        -- Recurrences
        -- Recursive doubling
        -- Cyclic reduction
        -- Parallel prefix
        -- Horner's scheme and Estrin's improvement
        -- FFT
        -- Monte Carlo
Starting at page 79: Dense matrix-vector operations
        -- Measuring computation and communication: there is an assumption on the communication that enables some quantitative estimates.
        -- AXPY
        -- Dot product
        -- Rowwise 1D partitioning for Matvec
        -- Block rowwise 1D partitioning for Matvec
        -- 2D partitioning for Matvec
        -- Block 2D partitining for Matvec
        -- Matmats
        -- Cannon algorithm
        -- SUMMA
        -- Gaussian elimination
        -- Extreme parallelism
        -- Strassen's approach
Graph partitioning.
Sparse solvers of linear algebraic equations in parallel environment.
        -- since now we use https://cesnet.zoom.us/j/906206832?pwd=U2FDSWluVVlERWIxTm81MWFoRks2UT09
Standard and Parallel preconditioning.
        -- Text for this part is in Czech: chapter 15 of [pdf]
Standard and Parallel preconditioning II.
        -- Text for this part is in Czech: chapter 15 of [pdf]
Standard and Parallel preconditioning III.
        -- Text for this part is in Czech: chapter 15 of [pdf]
Standard and Parallel preconditioning IV.
        -- Text for this part is in Czech: chapter 15 of [pdf]
        -- Finally finalized :-): chapter 15 of [pdf]
        -- Starting simple intro into the domain decomposition
Domain decomposition.
Additional resources (2021/2022)
W.L. Briggs, Van Emden Henson and
S.F. McCormick: A Multigrid Tutorial, 2nd edition, SIAM, PA, 2000
T.A. Davis, Direct methods for sparse linear systems,
SIAM, Philadelphia, PA, 2006.
J.J. Dongarra, I.S. Duff, D.C.
and H.A. van der Vorst, Numerical Linear Algebra for High-Performance
Computers, SIAM, PA, 1998;
A. Grama, A. Gupta, G. Karypis, V. Kumar: Introduction
to Parallel Computing,
2nd edition, Addison-Wesley, 2003
M.A. Olshanskii. E.E. Tyrtyshnikov: Iterative Methods for Linear Systems. Theory and Applications,
SIAM, Philadelphia, PA, 2014.
T. Rauber, G. Runger: Parallel Programming for Multicore and Cluster Systems,
2nd edition, Springer, Berlin-Heidelberg, 2012.
A. Toselli and O. Widlund: Domain Decomposition Methods -
Algorithms and Theory, Springer, 2004.