Skip to content

9.3 Mapping Constructs to Hardware: An Example

The theoretical concepts of programming models, compilers, and runtimes become tangible when we trace the path of a specific algorithm from source code to execution on physical hardware. By examining how a canonical parallel problem—matrix multiplication—is implemented on two distinct architectures, a multi-core CPU and a many-core GPU, we can illuminate the profound influence of hardware design on software strategy. The most efficient parallelization approach is not an intrinsic property of the algorithm but is instead dictated by the architectural realities of the target machine.

9.3.1 The Canonical Problem: Matrix multiplication

Section titled “9.3.1 The Canonical Problem: Matrix multiplication”

The multiplication of two matrices, C=A×B, is a cornerstone of linear algebra and a fundamental operation in countless scientific and engineering domains.1 The computation of each element Ci,j​ involves the dot product of the i-th row of matrix A and the j-th column of matrix B 2: Ci,j​=k=0∑N−1​Ai,k​×Bk,j​ This algorithm has a computational complexity of O(N3) for N×N matrices, making it computationally intensive and a prime candidate for parallelization.3 Critically, the calculation for each element Ci,j​ is completely independent of the calculation for any other element. This property makes matrix multiplication an “embarrassingly parallel” problem, providing a clear and ideal test case for demonstrating parallel implementation techniques.2

9.3.2 Mapping to a Multi-Core CPU with OpenMP

Section titled “9.3.2 Mapping to a Multi-Core CPU with OpenMP”

For a typical shared-memory multi-core CPU, OpenMP provides a straightforward and efficient path to parallelization.

A standard C++ implementation of matrix multiplication involves three nested loops. Parallelizing this with OpenMP is remarkably simple: a single directive, #pragma omp parallel for, is placed before the outermost loop. This instructs the compiler to divide the iterations of the i loop—which correspond to the rows of the output matrix C—among a team of parallel threads.1

C++

// C, A, and B are N x N matrices
\#pragma omp parallel for
for (int i \= 0; i \< N; i++) {
for (int j \= 0; j \< N; j++) {
double sum \= 0.0;
for (int k \= 0; k \< N; k++) {
sum \+= A\[i\]\[k\] \* B\[k\]\[j\];
}
C\[i\]\[j\] \= sum;
}
}

The journey from this pragma to hardware execution involves a precise collaboration between the compiler, the OpenMP runtime, and the operating system:

  1. Compiler Transformation: Upon seeing the #pragma omp parallel for, the compiler transforms the code. It generates a function containing the loop body and inserts calls to the OpenMP runtime library to manage the parallel region.4
  2. Runtime Thread Creation: When the program is executed, the OpenMP runtime creates a team of software threads. By default, this is often one thread per available CPU core.5
  3. Thread-to-Core Mapping (Affinity): The operating system scheduler, in conjunction with the OpenMP runtime, maps these software threads onto the physical CPU cores. This mapping, or thread affinity, is critical for performance. Programmers can exert control over this process using environment variables.6 OMP_PLACES defines the set of available hardware resources (e.g., cores, sockets), while OMP_PROC_BIND specifies the placement policy. For example, OMP_PROC_BIND=close will pack threads onto cores within a single socket before using another, which can be beneficial for applications with high data sharing. Conversely, OMP_PROC_BIND=spread will distribute threads as widely as possible across all sockets and cores, which can maximize memory bandwidth for applications where threads are more independent.7

Once threads are running on cores, their performance is governed by the intricate behavior of the CPU’s memory subsystem. The programmer’s view of memory as a simple, flat, shared space is a powerful abstraction, but it is layered upon a complex physical reality.

  • Cache Coherence: Each CPU core has its own private L1 and L2 caches to reduce memory access latency.8 When multiple threads on different cores compute rows of the matrix C, they are all writing to a shared data structure. This creates a cache coherence problem: a core’s private cache might hold a “stale” copy of a memory location that has been updated by another core.9 To solve this, multi-core CPUs implement a hardware cache coherence protocol, most commonly MESI (Modified, Exclusive, Shared, Invalid) or a variant.8 In this protocol, caches “snoop” on a shared bus. When one core writes to a cache line, it broadcasts an invalidation signal, forcing other cores that have a copy of that line to mark it as invalid. The next time another core needs that data, it will experience a cache miss and fetch the updated version from memory or directly from the modifying core’s cache.9 This hardware mechanism is what makes the shared-memory illusion work, but it comes with communication overhead that can impact performance, especially if threads write to nearby memory locations that fall on the same cache line (a phenomenon known as false sharing).8
  • Non-Uniform Memory Access (NUMA): In systems with multiple processor sockets, the shared memory is physically distributed. A CPU core can access the memory bank attached to its own socket (local memory) much faster than it can access memory attached to another socket (remote memory).10 This creates a Non-Uniform Memory Access architecture. If a thread running on a core in Socket 0 is assigned to compute a row of C whose data happens to have been allocated in the memory of Socket 1, every memory access will incur the higher latency of traversing the inter-socket interconnect.10 NUMA-aware scheduling, managed by the OS and guided by OpenMP affinity settings, is therefore crucial for performance. By ensuring that threads and the data they operate on are co-located on the same NUMA node (e.g., using OMP_PROC_BIND=close), these remote access penalties can be minimized, allowing the application to scale effectively.11

Mapping the same matrix multiplication problem to a GPU reveals a completely different architectural philosophy and, consequently, a different parallelization strategy.

A CUDA implementation consists of two parts: host code running on the CPU and a kernel running on the GPU. The host is responsible for allocating memory on both the host and device, transferring the input matrices A and B to the GPU, launching the kernel, and transferring the result matrix C back from the GPU.12

C++

// Host Code Snippet
cudaMalloc((void\*\*)\&d\_A, size);
cudaMalloc((void\*\*)\&d\_B, size);
cudaMalloc((void\*\*)\&d\_C, size);
cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / 16, N / 16);
matrixMulKernel\<\<\<numBlocks, threadsPerBlock\>\>\>(d\_A, d\_B, d\_C, N);
cudaMemcpy(h\_C, d\_C, size, cudaMemcpyDeviceToHost);
// Device (Kernel) Code
\_\_global\_\_ void matrixMulKernel(float\* A, float\* B, float\* C, int N) {
int row \= blockIdx.y \* blockDim.y \+ threadIdx.y;
int col \= blockIdx.x \* blockDim.x \+ threadIdx.x;
if (row \< N && col \< N) {
float sum \= 0.0f;
for (int k \= 0; k \< N; k++) {
sum \+= A\[row \* N \+ k\] \* B\[k \* N \+ col\];
}
C\[row \* N \+ col\] \= sum;
}
}

The CUDA mapping is hierarchical and explicitly defined by the programmer, translating directly to GPU hardware structures 13:

  1. Grid, Blocks, and Threads: The programmer organizes the computation into a Grid of Thread Blocks. In this case, a 2D grid of 2D blocks is natural. Each thread is assigned the task of computing a single element of the output matrix C.14
  2. Block-to-SM Mapping: The CUDA runtime is responsible for distributing the Thread Blocks from the grid onto the GPU’s Streaming Multiprocessors (SMs). An SM is the core processing unit of the GPU. Multiple blocks can be resident on a single SM at a time, and the runtime schedules them for execution as resources become available.12
  3. Thread-to-Warp-to-Core Mapping: Within an SM, threads are grouped by the hardware into Warps (typically 32 threads).12 The SM’s Warp Scheduler is the key to its efficiency. It selects a warp whose threads are ready to execute and issues their next instruction to the SM’s CUDA cores. The execution model is Single Instruction, Multiple Thread (SIMT): all 32 threads in a warp execute the same instruction at the same time, but on different data.12 If a warp must wait for a slow memory access, the scheduler can instantly switch to another resident warp, hiding the memory latency and keeping the computational units busy.

The GPU’s performance is inextricably linked to its explicit and non-uniform memory hierarchy. Unlike the CPU, which tries to hide this complexity, the GPU exposes it to the programmer, who must manage it skillfully to achieve high performance.

  • The GPU Memory Hierarchy: A thread has access to several distinct memory spaces.12 It has its own private, ultra-fast registers. Threads within the same block can communicate and share data through a fast, on-chip shared memory, which acts as a programmer-managed cache. All threads across all blocks can access the large but high-latency global memory (the device’s DRAM).15
  • Optimization via Tiling: The naive kernel shown above is highly inefficient because each thread makes many accesses to slow global memory. The standard high-performance solution is a tiled or blocked algorithm.14 In this approach, a thread block cooperatively loads small sub-matrices (tiles) of A and B from global memory into the fast shared memory. Then, all threads in the block perform the necessary multiplications for their output element using data exclusively from shared memory. This strategy drastically reduces global memory traffic—the primary performance bottleneck—by maximizing data reuse within the SM’s fastest memory space.14 Effective CUDA programming is, in large part, the art of structuring algorithms to maximize computation on data held in registers and shared memory while minimizing global memory access.

9.3.4 Synthesis: Contrasting CPU and GPU Mapping

Section titled “9.3.4 Synthesis: Contrasting CPU and GPU Mapping”

The parallelization of matrix multiplication on a CPU versus a GPU reveals fundamental differences in their architectural philosophies. The CPU is optimized for low latency on a few complex tasks, while the GPU is optimized for high throughput on many simple tasks. This leads to entirely different mapping strategies, memory considerations, and optimization goals.

  • Unit of Parallelism: On the CPU, the primary unit of parallelism is a heavyweight thread, managed by the OS and runtime. A small number of these threads are mapped to powerful cores. On the GPU, the fundamental unit is a massive number of lightweight threads, which are managed in hardware as warps.
  • Scheduling: The CPU relies on a complex, software-based scheduler (in the OS and runtime) that can perform sophisticated operations like preemption and work-stealing to ensure fairness and balance load. The GPU uses a simpler, hardware-based warp scheduler whose primary goal is to hide memory latency by rapidly switching between a large pool of available warps.
  • Memory Philosophy: The CPU architecture invests significant hardware complexity (large caches, sophisticated coherence protocols) to present the programmer with the simple illusion of a uniform, coherent shared memory space. The GPU architecture largely abandons this illusion, exposing its physical memory hierarchy to the programmer and demanding explicit data management to achieve performance.

These profound differences underscore the insight that an algorithm optimized for one architecture can be deeply inefficient on another. The “best” way to parallelize a problem is not a universal truth but a specific answer to the question: “What is the most effective way to map this computation onto the resources and constraints of this particular hardware?”

FeatureMulti-Core CPU (with OpenMP)Many-Core GPU (with CUDA)
Programming Construct#pragma omp parallel for__global__ kernel launch <<<…>>>
Logical Parallel UnitLoop Iteration (e.g., a matrix row)Single Data Element (e.g., a matrix cell)
Physical Execution UnitOS Thread mapped to a CPU CoreCUDA Thread, executed in a Warp of 32
SchedulingOS/Runtime scheduler (preemptive, work-stealing possible)Hardware Warp Scheduler (non-preemptive, latency hiding)
Key Memory ChallengeCache Coherence (MESI), NUMA localityGlobal Memory Latency
Optimization GoalMaximize cache hits, ensure NUMA-local accessMaximize use of Shared Memory, minimize global memory access
  1. Parallel Programming With OpenMP In C++ - Matrix Multiplication …, accessed October 7, 2025, https://www.c-sharpcorner.com/article/parallel-programming-with-openmp-in-cpp-matrix-multiplication-example/ 2

  2. 5.2 Matrix Multiplication — Parallel Computing for Beginners - Free Hands-On Materials for Learning PDC, accessed October 7, 2025, https://www.learnpdc.org/PDCBeginners/5-applications/matrix-multiply.html 2

  3. Parallelized Blocked Matrix Multiplication using OpenMP | by Charith Pietersz - Medium, accessed October 7, 2025, https://medium.com/@cj.ptsz/parallelized-blocked-matrix-multiplication-using-openmp-97a4bc620a47

  4. Pragma directives for OpenMP parallelization - IBM, accessed October 7, 2025, https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=reference-pragma-directives-openmp-parallelization

  5. An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html

  6. OpenMP thread mapping to physical cores - Stack Overflow, accessed October 7, 2025, https://stackoverflow.com/questions/4717251/openmp-thread-mapping-to-physical-cores

  7. Processor Affinity for OpenMP and MPI » ADMIN Magazine, accessed October 7, 2025, https://www.admin-magazine.com/HPC/Articles/Processor-Affinity-for-OpenMP-and-MPI

  8. Multicore processors and cache coherence | Intro to Computer Architecture Class Notes, accessed October 7, 2025, https://fiveable.me/introduction-computer-architecture/unit-7/multicore-processors-cache-coherence/study-guide/0Fr1h83bSS0NcUMH 2 3

  9. Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs …, accessed October 7, 2025, https://dev.to/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j 2

  10. Non-uniform memory access - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Non-uniform_memory_access 2

  11. Improving Parallel System Performance with a NUMA-aware Load Balancer - IDEALS, accessed October 7, 2025, https://www.ideals.illinois.edu/items/26080/bitstreams/89289/object?dl=1

  12. CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 2 3 4 5

  13. CUDA C++ Programming Guide - NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation

  14. Matrix Multiplication in CUDA - Harshit Kumar, accessed October 7, 2025, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda 2 3

  15. CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf