9.3 Mapping Constructs to Hardware
The theoretical concepts of programming models, compilers, and runtimes become more concrete when tracing a specific algorithm’s path from source code to execution on physical hardware. By examining how a standard parallel problem—matrix multiplication—is implemented on two distinct architectures (a multi-core CPU and a many-core GPU), we can highlight the significant influence of hardware design on software strategy. The most efficient parallelization approach is not an inherent property of the algorithm but is determined by the architectural characteristics of the target machine.
9.3.1 The Canonical Problem: Matrix Multiplication
Section titled “9.3.1 The Canonical Problem: Matrix Multiplication”The multiplication of two matrices, , is a fundamental operation in linear algebra and is widely used in scientific and engineering domains.1 The computation of each element involves the dot product of the -th row of matrix A and the -th column of matrix B:2
Embarrassingly Parallel: The calculation for each element is independent of the calculation for any other element. This property makes matrix multiplication an ideal test case for demonstrating parallel implementation techniques.2
Key Characteristics:
| Property | Value | Implication |
|---|---|---|
| Computational Complexity | for matrices3 | Computationally intensive; significant parallelization benefit |
| Data Independence | Each calculated independently | Perfect for parallel execution |
| Parallelization Potential | ”Embarrassingly parallel”2 | Multiple viable parallelization strategies across architectures |
9.3.2 Mapping to a Multi-Core CPU with OpenMP
Section titled “9.3.2 Mapping to a Multi-Core CPU with OpenMP”For a standard shared-memory multi-core CPU, OpenMP offers a direct and efficient method for parallelization.
The Code
Section titled “The Code”A standard C++ implementation of matrix multiplication uses three nested loops. Parallelizing this with OpenMP is straightforward: a single directive, #pragma omp parallel for, is placed before the outermost loop. This instructs the compiler to distribute the iterations of the i loop—which correspond to the rows of the output matrix C—among a team of parallel threads.1
C++
// C, A, and B are N x N matrices#pragma omp parallel forfor (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { double sum = 0.0; for (int k = 0; k < N; k++) { sum += A[i][k] * B[k][j]; } C[i][j] = sum; }}Logical-to-Physical Mapping
Section titled “Logical-to-Physical Mapping”The process from this pragma to hardware execution requires precise collaboration between the compiler, the OpenMP runtime, and the operating system:
| Stage | Component | Action | Result |
|---|---|---|---|
| 1. Compile-Time | Compiler | Transforms code upon encountering #pragma omp parallel for4 | Generates function for loop body; inserts OpenMP runtime calls |
| 2. Runtime Initialization | OpenMP Runtime | Creates team of software threads at program execution5 | Typically one thread per available CPU core |
| 3. Thread-to-Core Mapping | OS Scheduler + OpenMP Runtime | Maps software threads to physical CPU cores using affinity policies6 | Threads bound to specific cores based on policy |
Thread Affinity: The mapping of software threads to physical CPU cores, critical for performance. Programmers control this via environment variables:
OMP_PLACESdefines available hardware resources (cores, sockets);OMP_PROC_BINDspecifies placement policy.67
Thread Affinity Policies:
| Policy | Behavior | Best For | Example Use Case |
|---|---|---|---|
| close | Pack threads onto cores within a single socket before using another7 | Applications with high data sharing | Threads frequently access same cache lines |
| spread | Distribute threads as widely as possible across all sockets and cores7 | Applications with independent threads | Maximize aggregate memory bandwidth |
| master | Threads colocate near the master thread | NUMA-aware applications | Minimize remote memory access |
Hardware Considerations in Execution
Section titled “Hardware Considerations in Execution”Once threads are executing on cores, their performance is determined by the complex behavior of the CPU’s memory subsystem.
The Memory Abstraction Gap: The programmer’s view of memory as a simple, flat, shared space is a useful abstraction, but it is built upon a complex physical reality involving caches, coherence protocols, and non-uniform memory access.
Cache Coherence
Each CPU core has private L1 and L2 caches to minimize memory access latency.8 When multiple threads on different cores compute rows of matrix C, they all write to a shared data structure, introducing a cache coherence problem.
The Coherence Problem: A core’s private cache may hold a “stale” copy of a memory location that has been updated by another core. Multi-core CPUs implement hardware cache coherence protocols (typically MESI: Modified, Exclusive, Shared, Invalid) to maintain consistency.89
MESI Protocol Operation:
- Caches “snoop” on a shared bus
- When one core writes to a cache line, it broadcasts an invalidation signal
- Other cores with that line mark their copies as invalid
- Next access by another core causes cache miss → fetch updated version
False Sharing: A performance hazard where threads write to nearby memory locations on the same cache line, causing excessive invalidation traffic despite no logical data sharing.8
Non-Uniform Memory Access (NUMA)
In multi-socket systems, shared memory is physically distributed.
| Memory Type | Access Latency | Scenario |
|---|---|---|
| Local Memory | Low (fast) | Core accesses memory bank on its own socket10 |
| Remote Memory | High (slow) | Core accesses memory bank on different socket10 |
NUMA Impact: If a thread on Socket 0 computes a row of C whose data is allocated in Socket 1’s memory, every access incurs the higher latency of traversing the inter-socket interconnect. NUMA-aware scheduling (e.g.,
OMP_PROC_BIND=close) co-locates threads and data on the same NUMA node, minimizing remote access penalties.1011
9.3.3 Mapping to a GPU with CUDA
Section titled “9.3.3 Mapping to a GPU with CUDA”Mapping the same matrix multiplication problem to a GPU demonstrates a completely different architectural philosophy and, as a result, a different parallelization strategy.
The Code
Section titled “The Code”A CUDA implementation includes two parts: host code that runs on the CPU and a kernel that runs on the GPU. The host is responsible for allocating memory on both the host and the device, transferring the input matrices A and B to the GPU, launching the kernel, and transferring the resulting matrix C back from the GPU.12
#include <cuda_runtime.h>#include <stdio.h>#include <stdlib.h>
__global__ void matrixMulKernel(float* A, float* B, float* C, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) { float sum = 0.0f; for (int k = 0; k < N; k++) { sum += A[row * N + k] * B[k * N + col]; } C[row * N + col] = sum; }}
int main() { const int N = 1024; size_t size = N * N * sizeof(float);
float *h_A = (float*)malloc(size); float *h_B = (float*)malloc(size); float *h_C = (float*)malloc(size);
for (int i = 0; i < N * N; i++) { h_A[i] = 1.0f; h_B[i] = 2.0f; }
float *d_A, *d_B, *d_C; cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(16, 16); dim3 numBlocks((N + 15) / 16, (N + 15) / 16); matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("Result: C[0][0] = %.2f\n", h_C[0]);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C);
return 0;}Compilation:
nvcc -o matrix_mul matrix_mul.cu./matrix_mulLogical-to-Physical Mapping
Section titled “Logical-to-Physical Mapping”The CUDA mapping is hierarchical and explicitly defined by the programmer, which translates directly to GPU hardware structures:13
| Hierarchy Level | Logical Construct | Physical Hardware | Mapping Details |
|---|---|---|---|
| 1. Grid | Grid of Thread Blocks | Entire GPU | Programmer organizes computation into 2D/3D grid; each thread computes one output element14 |
| 2. Block | Thread Block | Streaming Multiprocessor (SM) | CUDA runtime distributes blocks to SMs; multiple blocks can reside on one SM12 |
| 3. Warp | Group of 32 Threads | Warp Scheduler + CUDA Cores | Hardware groups threads into warps; SM’s Warp Scheduler issues instructions12 |
Single Instruction, Multiple Thread (SIMT): All 32 threads in a warp execute the same instruction simultaneously but on different data. If a warp waits for memory, the scheduler instantly switches to another resident warp, hiding latency and keeping computational units busy.12
CUDA Execution Hierarchy:
Grid (entire GPU)└── Thread Blocks (mapped to SMs) └── Warps (32 threads, scheduled by SM) └── Individual Threads (execute on CUDA cores)Key GPU Architecture Concepts:
| Component | Role | Count/Capability |
|---|---|---|
| Streaming Multiprocessor (SM) | Core processing unit of GPU12 | Multiple per GPU; runs multiple blocks concurrently |
| Warp | Hardware thread grouping12 | 32 threads; execute in lockstep (SIMT) |
| Warp Scheduler | Selects warps for execution12 | Hides memory latency via rapid context switching |
| CUDA Core | Arithmetic logic unit | Hundreds per SM; execute thread instructions |
Hardware Considerations in Execution
Section titled “Hardware Considerations in Execution”The GPU’s performance is closely tied to its explicit and non-uniform memory hierarchy.
CPU vs. GPU Memory Philosophy: The CPU attempts to hide memory complexity through automatic caching. The GPU exposes its memory hierarchy to the programmer, who must manage it explicitly to achieve high performance.1512
GPU Memory Hierarchy:
| Memory Type | Scope | Speed | Size | Programmer Control | Typical Use |
|---|---|---|---|---|---|
| Registers | Per-thread private | Fastest | Very small | Automatic (compiler-managed) | Loop variables, temporaries |
| Shared Memory | Per-block shared | Very fast (on-chip) | Small (KB per SM) | Explicit | Cooperative data sharing; programmer-managed cache12 |
| Global Memory | All threads | Slow (high latency) | Large (GB DRAM)15 | Explicit | Input/output data; main data storage |
| Constant Memory | Read-only, all threads | Fast (cached) | Small | Explicit | Read-only lookup tables |
| Texture Memory | Read-only, all threads | Fast (cached, optimized for 2D locality) | Large | Explicit | Image processing |
The Primary GPU Bottleneck: Global memory access. The naive kernel performs numerous accesses to slow global memory, making it highly inefficient. The solution is tiling.14
Optimization via Tiling:
- Problem: Each thread performs many slow global memory accesses
- Solution: Thread block cooperatively loads sub-matrices (tiles) of A and B from global memory into fast shared memory
- Benefit: Threads compute using data exclusively from shared memory
- Result: Dramatically reduced global memory traffic; maximized data reuse in fast memory14
Effective CUDA Programming: Structure algorithms to maximize computation on data in registers and shared memory while minimizing global memory access. This optimization strategy defines the difference between naive and high-performance GPU code.14
9.3.4 Synthesis: Contrasting CPU and GPU Mapping
Section titled “9.3.4 Synthesis: Contrasting CPU and GPU Mapping”The parallelization of matrix multiplication on a CPU versus a GPU highlights fundamental differences in their architectural philosophies.
Architectural Optimization Goals: The CPU is optimized for low-latency execution of a few complex tasks. The GPU is optimized for high-throughput execution of many simple tasks. This fundamental difference drives entirely different mapping strategies, memory considerations, and optimization goals.
Key Architectural Contrasts:
| Dimension | Multi-Core CPU | Many-Core GPU |
|---|---|---|
| Unit of Parallelism | Heavyweight thread (OS/runtime-managed); small number mapped to powerful cores | Lightweight thread (hardware-managed in warps); thousands of threads |
| Scheduling Approach | Complex software-based (OS + runtime); sophisticated preemption and work-stealing for fairness/load balancing | Simple hardware-based warp scheduler; rapid context switching to hide memory latency |
| Memory Philosophy | Hardware complexity (large caches, coherence protocols) provides illusion of uniform, coherent shared memory | Exposes physical memory hierarchy; requires explicit programmer management for performance |
| Optimization Focus | Maximize cache hits; ensure NUMA-local access; minimize synchronization overhead | Maximize shared memory use; minimize global memory access; maximize warp occupancy |
| Parallelism Granularity | Coarse (rows/chunks of matrix) | Fine (individual matrix elements) |
The Architecture-Algorithm Dependency: An algorithm optimized for one architecture can be highly inefficient on another. The optimal parallelization method is not universal but architecture-specific—it answers: “What is the most effective way to map this computation onto the resources and constraints of this particular hardware?”
These contrasts demonstrate why portable performance across diverse hardware remains a fundamental challenge in parallel computing. The same algorithm requires radically different implementations to achieve efficiency on different architectures.
| Feature | Multi-Core CPU (with OpenMP) | Many-Core GPU (with CUDA) |
|---|---|---|
| Programming Construct | #pragma omp parallel for | __global__ kernel launch <<<...>>> |
| Logical Parallel Unit | Loop Iteration (e.g., a matrix row) | Single Data Element (e.g., a matrix cell) |
| Physical Execution Unit | OS Thread mapped to a CPU Core | CUDA Thread, executed in a Warp of 32 |
| Scheduling | OS/Runtime scheduler (preemptive, work-stealing possible) | Hardware Warp Scheduler (non-preemptive, latency hiding) |
| Key Memory Challenge | Cache Coherence (MESI), NUMA locality | Global Memory Latency |
| Optimization Goal | Maximize cache hits, ensure NUMA-local access | Maximize use of Shared Memory, minimize global memory access |
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Parallel Programming With OpenMP In C++ - Matrix Multiplication …, accessed October 7, 2025, https://www.c-sharpcorner.com/article/parallel-programming-with-openmp-in-cpp-matrix-multiplication-example/ ↩ ↩2
-
5.2 Matrix Multiplication — Parallel Computing for Beginners - Free Hands-On Materials for Learning PDC, accessed October 7, 2025, https://www.learnpdc.org/PDCBeginners/5-applications/matrix-multiply.html ↩ ↩2 ↩3
-
Parallelized Blocked Matrix Multiplication using OpenMP | by Charith Pietersz - Medium, accessed October 7, 2025, https://medium.com/@cj.ptsz/parallelized-blocked-matrix-multiplication-using-openmp-97a4bc620a47 ↩
-
Pragma directives for OpenMP parallelization - IBM, accessed October 7, 2025, https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=reference-pragma-directives-openmp-parallelization ↩
-
An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html ↩
-
OpenMP thread mapping to physical cores - Stack Overflow, accessed October 7, 2025, https://stackoverflow.com/questions/4717251/openmp-thread-mapping-to-physical-cores ↩ ↩2
-
Processor Affinity for OpenMP and MPI » ADMIN Magazine, accessed October 7, 2025, https://www.admin-magazine.com/HPC/Articles/Processor-Affinity-for-OpenMP-and-MPI ↩ ↩2 ↩3
-
Multicore processors and cache coherence | Intro to Computer Architecture Class Notes, accessed October 7, 2025, https://fiveable.me/introduction-computer-architecture/unit-7/multicore-processors-cache-coherence/study-guide/0Fr1h83bSS0NcUMH ↩ ↩2 ↩3
-
Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs …, accessed October 7, 2025, https://dev.to/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j ↩
-
Non-uniform memory access - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Non-uniform_memory_access ↩ ↩2 ↩3
-
Improving Parallel System Performance with a NUMA-aware Load Balancer - IDEALS, accessed October 7, 2025, https://www.ideals.illinois.edu/items/26080/bitstreams/89289/object?dl=1 ↩
-
CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
CUDA C++ Programming Guide - NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation ↩
-
Matrix Multiplication in CUDA - Harshit Kumar, accessed October 7, 2025, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda ↩ ↩2 ↩3 ↩4
-
CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf ↩ ↩2