Skip to content

9.3 Mapping Constructs to Hardware

The theoretical concepts of programming models, compilers, and runtimes become more concrete when tracing a specific algorithm’s path from source code to execution on physical hardware. By examining how a standard parallel problem—matrix multiplication—is implemented on two distinct architectures (a multi-core CPU and a many-core GPU), we can highlight the significant influence of hardware design on software strategy. The most efficient parallelization approach is not an inherent property of the algorithm but is determined by the architectural characteristics of the target machine.

9.3.1 The Canonical Problem: Matrix Multiplication

Section titled “9.3.1 The Canonical Problem: Matrix Multiplication”

The multiplication of two matrices, C=A×BC = A \times B, is a fundamental operation in linear algebra and is widely used in scientific and engineering domains.1 The computation of each element Ci,jC_{i,j} involves the dot product of the ii-th row of matrix A and the jj-th column of matrix B:2

Ci,j=k=0N1Ai,k×Bk,jC_{i,j} = \sum_{k=0}^{N-1} A_{i,k} \times B_{k,j}

Embarrassingly Parallel: The calculation for each element Ci,jC_{i,j} is independent of the calculation for any other element. This property makes matrix multiplication an ideal test case for demonstrating parallel implementation techniques.2

Key Characteristics:

PropertyValueImplication
Computational ComplexityO(N3)O(N^3) for N×NN \times N matrices3Computationally intensive; significant parallelization benefit
Data IndependenceEach Ci,jC_{i,j} calculated independentlyPerfect for parallel execution
Parallelization Potential”Embarrassingly parallel”2Multiple viable parallelization strategies across architectures

9.3.2 Mapping to a Multi-Core CPU with OpenMP

Section titled “9.3.2 Mapping to a Multi-Core CPU with OpenMP”

For a standard shared-memory multi-core CPU, OpenMP offers a direct and efficient method for parallelization.

A standard C++ implementation of matrix multiplication uses three nested loops. Parallelizing this with OpenMP is straightforward: a single directive, #pragma omp parallel for, is placed before the outermost loop. This instructs the compiler to distribute the iterations of the i loop—which correspond to the rows of the output matrix C—among a team of parallel threads.1

C++

// C, A, and B are N x N matrices
#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
double sum = 0.0;
for (int k = 0; k < N; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}

The process from this pragma to hardware execution requires precise collaboration between the compiler, the OpenMP runtime, and the operating system:

StageComponentActionResult
1. Compile-TimeCompilerTransforms code upon encountering #pragma omp parallel for4Generates function for loop body; inserts OpenMP runtime calls
2. Runtime InitializationOpenMP RuntimeCreates team of software threads at program execution5Typically one thread per available CPU core
3. Thread-to-Core MappingOS Scheduler + OpenMP RuntimeMaps software threads to physical CPU cores using affinity policies6Threads bound to specific cores based on policy

Thread Affinity: The mapping of software threads to physical CPU cores, critical for performance. Programmers control this via environment variables: OMP_PLACES defines available hardware resources (cores, sockets); OMP_PROC_BIND specifies placement policy.67

Thread Affinity Policies:

PolicyBehaviorBest ForExample Use Case
closePack threads onto cores within a single socket before using another7Applications with high data sharingThreads frequently access same cache lines
spreadDistribute threads as widely as possible across all sockets and cores7Applications with independent threadsMaximize aggregate memory bandwidth
masterThreads colocate near the master threadNUMA-aware applicationsMinimize remote memory access

Once threads are executing on cores, their performance is determined by the complex behavior of the CPU’s memory subsystem.

The Memory Abstraction Gap: The programmer’s view of memory as a simple, flat, shared space is a useful abstraction, but it is built upon a complex physical reality involving caches, coherence protocols, and non-uniform memory access.

Cache Coherence

Each CPU core has private L1 and L2 caches to minimize memory access latency.8 When multiple threads on different cores compute rows of matrix C, they all write to a shared data structure, introducing a cache coherence problem.

The Coherence Problem: A core’s private cache may hold a “stale” copy of a memory location that has been updated by another core. Multi-core CPUs implement hardware cache coherence protocols (typically MESI: Modified, Exclusive, Shared, Invalid) to maintain consistency.89

MESI Protocol Operation:

  1. Caches “snoop” on a shared bus
  2. When one core writes to a cache line, it broadcasts an invalidation signal
  3. Other cores with that line mark their copies as invalid
  4. Next access by another core causes cache miss → fetch updated version

False Sharing: A performance hazard where threads write to nearby memory locations on the same cache line, causing excessive invalidation traffic despite no logical data sharing.8

Non-Uniform Memory Access (NUMA)

In multi-socket systems, shared memory is physically distributed.

Memory TypeAccess LatencyScenario
Local MemoryLow (fast)Core accesses memory bank on its own socket10
Remote MemoryHigh (slow)Core accesses memory bank on different socket10

NUMA Impact: If a thread on Socket 0 computes a row of C whose data is allocated in Socket 1’s memory, every access incurs the higher latency of traversing the inter-socket interconnect. NUMA-aware scheduling (e.g., OMP_PROC_BIND=close) co-locates threads and data on the same NUMA node, minimizing remote access penalties.1011

Mapping the same matrix multiplication problem to a GPU demonstrates a completely different architectural philosophy and, as a result, a different parallelization strategy.

A CUDA implementation includes two parts: host code that runs on the CPU and a kernel that runs on the GPU. The host is responsible for allocating memory on both the host and the device, transferring the input matrices A and B to the GPU, launching the kernel, and transferring the resulting matrix C back from the GPU.12

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
__global__ void matrixMulKernel(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
int main() {
const int N = 1024;
size_t size = N * N * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
for (int i = 0; i < N * N; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
float *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(16, 16);
dim3 numBlocks((N + 15) / 16, (N + 15) / 16);
matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("Result: C[0][0] = %.2f\n", h_C[0]);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}

Compilation:

Terminal window
nvcc -o matrix_mul matrix_mul.cu
./matrix_mul

The CUDA mapping is hierarchical and explicitly defined by the programmer, which translates directly to GPU hardware structures:13

Hierarchy LevelLogical ConstructPhysical HardwareMapping Details
1. GridGrid of Thread BlocksEntire GPUProgrammer organizes computation into 2D/3D grid; each thread computes one output element14
2. BlockThread BlockStreaming Multiprocessor (SM)CUDA runtime distributes blocks to SMs; multiple blocks can reside on one SM12
3. WarpGroup of 32 ThreadsWarp Scheduler + CUDA CoresHardware groups threads into warps; SM’s Warp Scheduler issues instructions12

Single Instruction, Multiple Thread (SIMT): All 32 threads in a warp execute the same instruction simultaneously but on different data. If a warp waits for memory, the scheduler instantly switches to another resident warp, hiding latency and keeping computational units busy.12

CUDA Execution Hierarchy:

Grid (entire GPU)
└── Thread Blocks (mapped to SMs)
└── Warps (32 threads, scheduled by SM)
└── Individual Threads (execute on CUDA cores)

Key GPU Architecture Concepts:

ComponentRoleCount/Capability
Streaming Multiprocessor (SM)Core processing unit of GPU12Multiple per GPU; runs multiple blocks concurrently
WarpHardware thread grouping1232 threads; execute in lockstep (SIMT)
Warp SchedulerSelects warps for execution12Hides memory latency via rapid context switching
CUDA CoreArithmetic logic unitHundreds per SM; execute thread instructions

The GPU’s performance is closely tied to its explicit and non-uniform memory hierarchy.

CPU vs. GPU Memory Philosophy: The CPU attempts to hide memory complexity through automatic caching. The GPU exposes its memory hierarchy to the programmer, who must manage it explicitly to achieve high performance.1512

GPU Memory Hierarchy:

Memory TypeScopeSpeedSizeProgrammer ControlTypical Use
RegistersPer-thread privateFastestVery smallAutomatic (compiler-managed)Loop variables, temporaries
Shared MemoryPer-block sharedVery fast (on-chip)Small (KB per SM)ExplicitCooperative data sharing; programmer-managed cache12
Global MemoryAll threadsSlow (high latency)Large (GB DRAM)15ExplicitInput/output data; main data storage
Constant MemoryRead-only, all threadsFast (cached)SmallExplicitRead-only lookup tables
Texture MemoryRead-only, all threadsFast (cached, optimized for 2D locality)LargeExplicitImage processing

The Primary GPU Bottleneck: Global memory access. The naive kernel performs numerous accesses to slow global memory, making it highly inefficient. The solution is tiling.14

Optimization via Tiling:

  1. Problem: Each thread performs many slow global memory accesses
  2. Solution: Thread block cooperatively loads sub-matrices (tiles) of A and B from global memory into fast shared memory
  3. Benefit: Threads compute using data exclusively from shared memory
  4. Result: Dramatically reduced global memory traffic; maximized data reuse in fast memory14

Effective CUDA Programming: Structure algorithms to maximize computation on data in registers and shared memory while minimizing global memory access. This optimization strategy defines the difference between naive and high-performance GPU code.14

9.3.4 Synthesis: Contrasting CPU and GPU Mapping

Section titled “9.3.4 Synthesis: Contrasting CPU and GPU Mapping”

The parallelization of matrix multiplication on a CPU versus a GPU highlights fundamental differences in their architectural philosophies.

Architectural Optimization Goals: The CPU is optimized for low-latency execution of a few complex tasks. The GPU is optimized for high-throughput execution of many simple tasks. This fundamental difference drives entirely different mapping strategies, memory considerations, and optimization goals.

Key Architectural Contrasts:

DimensionMulti-Core CPUMany-Core GPU
Unit of ParallelismHeavyweight thread (OS/runtime-managed); small number mapped to powerful coresLightweight thread (hardware-managed in warps); thousands of threads
Scheduling ApproachComplex software-based (OS + runtime); sophisticated preemption and work-stealing for fairness/load balancingSimple hardware-based warp scheduler; rapid context switching to hide memory latency
Memory PhilosophyHardware complexity (large caches, coherence protocols) provides illusion of uniform, coherent shared memoryExposes physical memory hierarchy; requires explicit programmer management for performance
Optimization FocusMaximize cache hits; ensure NUMA-local access; minimize synchronization overheadMaximize shared memory use; minimize global memory access; maximize warp occupancy
Parallelism GranularityCoarse (rows/chunks of matrix)Fine (individual matrix elements)

The Architecture-Algorithm Dependency: An algorithm optimized for one architecture can be highly inefficient on another. The optimal parallelization method is not universal but architecture-specific—it answers: “What is the most effective way to map this computation onto the resources and constraints of this particular hardware?”

These contrasts demonstrate why portable performance across diverse hardware remains a fundamental challenge in parallel computing. The same algorithm requires radically different implementations to achieve efficiency on different architectures.

FeatureMulti-Core CPU (with OpenMP)Many-Core GPU (with CUDA)
Programming Construct#pragma omp parallel for__global__ kernel launch <<<...>>>
Logical Parallel UnitLoop Iteration (e.g., a matrix row)Single Data Element (e.g., a matrix cell)
Physical Execution UnitOS Thread mapped to a CPU CoreCUDA Thread, executed in a Warp of 32
SchedulingOS/Runtime scheduler (preemptive, work-stealing possible)Hardware Warp Scheduler (non-preemptive, latency hiding)
Key Memory ChallengeCache Coherence (MESI), NUMA localityGlobal Memory Latency
Optimization GoalMaximize cache hits, ensure NUMA-local accessMaximize use of Shared Memory, minimize global memory access
  1. Parallel Programming With OpenMP In C++ - Matrix Multiplication …, accessed October 7, 2025, https://www.c-sharpcorner.com/article/parallel-programming-with-openmp-in-cpp-matrix-multiplication-example/ 2

  2. 5.2 Matrix Multiplication — Parallel Computing for Beginners - Free Hands-On Materials for Learning PDC, accessed October 7, 2025, https://www.learnpdc.org/PDCBeginners/5-applications/matrix-multiply.html 2 3

  3. Parallelized Blocked Matrix Multiplication using OpenMP | by Charith Pietersz - Medium, accessed October 7, 2025, https://medium.com/@cj.ptsz/parallelized-blocked-matrix-multiplication-using-openmp-97a4bc620a47

  4. Pragma directives for OpenMP parallelization - IBM, accessed October 7, 2025, https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=reference-pragma-directives-openmp-parallelization

  5. An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html

  6. OpenMP thread mapping to physical cores - Stack Overflow, accessed October 7, 2025, https://stackoverflow.com/questions/4717251/openmp-thread-mapping-to-physical-cores 2

  7. Processor Affinity for OpenMP and MPI » ADMIN Magazine, accessed October 7, 2025, https://www.admin-magazine.com/HPC/Articles/Processor-Affinity-for-OpenMP-and-MPI 2 3

  8. Multicore processors and cache coherence | Intro to Computer Architecture Class Notes, accessed October 7, 2025, https://fiveable.me/introduction-computer-architecture/unit-7/multicore-processors-cache-coherence/study-guide/0Fr1h83bSS0NcUMH 2 3

  9. Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs …, accessed October 7, 2025, https://dev.to/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j

  10. Non-uniform memory access - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Non-uniform_memory_access 2 3

  11. Improving Parallel System Performance with a NUMA-aware Load Balancer - IDEALS, accessed October 7, 2025, https://www.ideals.illinois.edu/items/26080/bitstreams/89289/object?dl=1

  12. CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 2 3 4 5 6 7 8 9

  13. CUDA C++ Programming Guide - NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation

  14. Matrix Multiplication in CUDA - Harshit Kumar, accessed October 7, 2025, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda 2 3 4

  15. CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf 2