8.5 Parallel Matrix Multiplication
Matrix-matrix multiplication is a cornerstone of scientific and engineering computation, forming the computational core of applications in fields from linear algebra and physics simulations to machine learning. Its high computational intensity, with arithmetic operations for matrices, makes it an excellent candidate for parallelization and a standard benchmark for high-performance computing (HPC) systems.1 This case study examines two classic algorithms for distributed-memory parallel matrix multiplication, Cannon’s algorithm and SUMMA, which employ distinct data decomposition and communication patterns to distribute the workload across a 2D grid of processors.
Cannon’s Algorithm
Section titled “Cannon’s Algorithm”Cannon’s algorithm is a memory-efficient parallel algorithm for multiplying two matrices on a 2D processor grid, typically a square grid of processors with a torus (wraparound) network topology.2 It orchestrates the movement of data blocks in a highly structured, nearest-neighbor fashion to ensure each processor computes its share of the result matrix.
Methodology
Section titled “Methodology”The algorithm for computing on a grid, where processor is responsible for computing the sub-block , proceeds in two main phases:2
- Initial Alignment: Before the main computation begins, the input matrices A and B are pre-aligned.
- For each row , the block is circularly shifted left by positions.
- For each column , the block is circularly shifted up by positions. This initial skew ensures that after alignment, processor holds blocks and . These are the blocks needed to compute the first term in the sum for .
- Compute-and-Shift Loop: The algorithm then enters a main loop that iterates times. In each step of the loop:
- Compute: Each processor multiplies its current local blocks of A and B and adds the result to its local block.
- Shift: Each processor shifts its block of A one position to the left (circularly) and its block of B one position up (circularly), receiving new blocks from its right and bottom neighbors, respectively.
After steps, each block of A will have passed through each processor in its row, and each block of B will have passed through each processor in its column, ensuring that all necessary products for each have been computed. The algorithm is memory-efficient because each processor only needs to store one block of A, one of B, and one of C at any given time.1

Credit: https://www.cs.fsu.edu/~xyuan/cop5570/lecture17_files/image018.gif
Algorithm: CannonsAlgorithm(A[$n \times n$], B[$n \times n$], $\sqrt{p} \times \sqrt{p}$ processor grid)Input: Matrices A and B of size $n \times n$, $p = \sqrt{p} \times \sqrt{p}$ processorsOutput: Matrix $C = A \times B$
// Assume n is divisible by $\sqrt{p}$// Processor $P(i,j)$ initially holds blocks $A(i,j)$ and $B(i,j)$
block_size = $n / \sqrt{p}$
// Phase 1: Initial Alignment (Pre-skewing)parallel for each processor $P(i, j)$ do // Shift A blocks: row i shifted left by i positions (circular) A_local = receive_from($P(i, (j+i) \bmod \sqrt{p})$)
// Shift B blocks: column j shifted up by j positions (circular) B_local = receive_from($P((i+j) \bmod \sqrt{p}, j)$)
// Initialize result block C_local = 0end parallel
// Phase 2: Compute-and-Shift Loop ($\sqrt{p}$ iterations)for step = 0 to $\sqrt{p} - 1$ do parallel for each processor $P(i, j)$ do // Compute: multiply local blocks and accumulate C_local = C_local + (A_local $\times$ B_local)
// Shift A blocks: send left, receive from right send(A_local, to_processor=$P(i, (j-1) \bmod \sqrt{p})$) A_local = receive_from($P(i, (j+1) \bmod \sqrt{p})$)
// Shift B blocks: send up, receive from below send(B_local, to_processor=$P((i-1) \bmod \sqrt{p}, j)$) B_local = receive_from($P((i+1) \bmod \sqrt{p}, j)$) end parallelend
// Each processor now has its final $C(i,j)$ blockreturn C assembled from all C_local blocksTime Complexity:
- Computation: steps =
- Communication: nearest-neighbor shifts
- Memory per processor:
Communication pattern: Torus/wraparound grid with nearest-neighbor only
SUMMA (Scalable Universal Matrix Multiplication Algorithm)
Section titled “SUMMA (Scalable Universal Matrix Multiplication Algorithm)”SUMMA is a more flexible and widely used algorithm that is the basis for matrix multiplication in libraries like ScaLAPACK. Instead of nearest-neighbor shifts, it uses broadcast communication and formulates the multiplication as a sequence of rank-k updates.3
Methodology
Section titled “Methodology”SUMMA also operates on a 2D processor grid, but it does not require the grid to be square. The algorithm for proceeds by iterating over block columns of A and block rows of B:4
- Outer Loop: The algorithm iterates times, where is the inner dimension of the matrices and is the block size (panel width).
- Broadcast and Compute: In each iteration, say the -th one:
- The processors owning the -th block column of A (a panel of width ) broadcast their data horizontally across their respective processor rows.
- Simultaneously, the processors owning the -th block row of B broadcast their data vertically down their respective processor columns.
- Each processor now has the necessary blocks of A and B. It performs a local matrix multiplication of the received blocks and adds the result to its local block.
This process is repeated for all block columns of A and block rows of B. The result matrix C remains stationary, while the relevant panels of A and B are broadcast to all processors that need them for the current rank- update.3

Credit: Stack Overflow, user ‘Marius’
Algorithm: SUMMA(A[$n \times n$], B[$n \times n$], $p_{rows} \times p_{cols}$ processor grid)Input: Matrices A and B of size $n \times n$, processor grid $p_{rows} \times p_{cols}$Output: Matrix $C = A \times B$
block_size = $n / \sqrt{p}$num_panels = $n$ / block_size
// Initialize result blocksparallel for each processor $P(i, j)$ do C_local[i][j] = 0end parallel
// Iterate over panels and accumulate partial productsfor k = 0 to num_panels - 1 do parallel for each processor $P(i, j)$ do // Broadcast k-th column of A horizontally across row i if j == owner_of_column_k then A_panel = A_local[i][k] end A_panel = broadcast_along_row(A_panel, row=i)
// Broadcast k-th row of B vertically down column j if i == owner_of_row_k then B_panel = B_local[k][j] end B_panel = broadcast_along_column(B_panel, column=j)
// Local multiplication and accumulation C_local[i][j] = C_local[i][j] + (A_panel $\times$ B_panel) end parallelend
return C assembled from all C_local blocksExample for processors:
- Iteration 0: Broadcast column 0 of A, row 0 of B
- computes:
- computes:
- Iteration 1: Broadcast column 1 of A, row 1 of B
- Continue accumulating partial products
Time Complexity:
- Computation:
- Communication: broadcasts where is matrix dimension
- Memory per processor:
Performance Analysis and Comparison
Section titled “Performance Analysis and Comparison”The two algorithms embody different approaches to data orchestration. Cannon’s algorithm is a “roll-roll-compute” method, where data blocks are systematically rolled through a static processor grid via nearest-neighbor communication.3 It is highly synchronous and spatially oriented. SUMMA is a “broadcast-broadcast-compute” method that replicates the currently needed data panels across processor rows and columns, while the result matrix C remains stationary.3
- Communication: Cannon’s algorithm relies on nearest-neighbor shift operations. SUMMA relies on broadcast operations within processor rows and columns.5 While broadcasts may seem more expensive, they can be efficiently pipelined on modern interconnects. SUMMA’s pattern can often utilize network links more effectively than the strictly 2D communication of Cannon’s algorithm.5
- Flexibility: SUMMA is significantly more flexible. It does not require a square processor grid or a torus topology. It also handles rectangular matrices and different block sizes, which has made it the de facto standard in production linear algebra libraries.3
- Memory: Both are “2D” algorithms, meaning they distribute a single copy of the matrices over a 2D processor grid and have a memory footprint of per processor. Under this constraint, their communication costs are asymptotically optimal.5
This analysis leads to a deeper principle in modern HPC: trading memory for communication. The existence of “2.5D” and “3D” matrix multiplication algorithms demonstrates this trade-off. A 3D algorithm, for instance, might arrange processors in a cube and replicate the input matrices times along the third dimension. This extra memory usage allows much of the required data to be local, dramatically reducing the volume of data that must be communicated over the network. Using copies of the data can reduce the communication bandwidth cost by a factor of and the latency cost by .6 This shows that the notion of an “optimal” algorithm is not absolute but is contingent on available resources. While 2D algorithms like Cannon’s and SUMMA are optimal under minimal memory constraints, communication-avoiding algorithms like the 2.5D/3D variants can achieve superior performance if additional memory is available.
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Lecture 6: Parallel Matrix Algorithms (part 3), accessed October 6, 2025, [https://www3.nd.edu/ zxu2/acms60212-40212-S12/Lec-07-3.pdf](https://www3.nd.edu/%E2%80%8E zxu2/acms60212-40212-S12/Lec-07-3.pdf) ↩ ↩2
-
Cannon’s algorithm for distributed matrix multiplication – OpenGenus IQ, accessed October 6, 2025, https://iq.opengenus.org/cannon-algorithm-distributed-matrix-multiplication/ ↩ ↩2
-
PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY 1. Introduction. This paper serves a number of purposes, accessed October 6, 2025, https://www.cs.utexas.edu/flame/pubs/SUMMA2d3dTOMS.pdf ↩ ↩2 ↩3 ↩4 ↩5
-
CS 140 Homework 3: SUMMA Matrix Multiplication – UCSB Computer Science, accessed October 6, 2025, https://sites.cs.ucsb.edu/gilbert/cs140/old/cs140Win2009/assignments/hw3.pdf ↩
-
Matrix multiplication on multidimensional torus networks – UC Berkeley EECS, accessed October 6, 2025, http://eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-28.pdf ↩ ↩2 ↩3
-
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms – The Netlib, accessed October 6, 2025, https://www.netlib.org/lapack/lawnspdf/lawn248.pdf ↩