9.3 Mapping Constructs to Hardware

The theoretical concepts of programming models, compilers, and runtimes become more concrete when tracing a specific algorithm’s path from source code to execution on physical hardware. By examining how a standard parallel problem—matrix multiplication—is implemented on two distinct architectures (a multi-core CPU and a many-core GPU), we can highlight the significant influence of hardware design on software strategy. The most efficient parallelization approach is not an inherent property of the algorithm but is determined by the architectural characteristics of the target machine.

9.3.1 The Canonical Problem: Matrix Multiplication

The multiplication of two matrices, $C = A \times B$ , is a fundamental operation in linear algebra and is widely used in scientific and engineering domains.¹ The computation of each element $C_{i,j}$ involves the dot product of the $i$ -th row of matrix A and the $j$ -th column of matrix B:²

$C_{i,j} = \sum_{k=0}^{N-1} A_{i,k} \times B_{k,j}$

Embarrassingly Parallel: The calculation for each element $C_{i,j}$ is independent of the calculation for any other element. This property makes matrix multiplication an ideal test case for demonstrating parallel implementation techniques.²

Key Characteristics:

Property	Value	Implication
Computational Complexity	$O(N^3)$ for $N \times N$ matrices³	Computationally intensive; significant parallelization benefit
Data Independence	Each $C_{i,j}$ calculated independently	Perfect for parallel execution
Parallelization Potential	”Embarrassingly parallel”²	Multiple viable parallelization strategies across architectures

9.3.2 Mapping to a Multi-Core CPU with OpenMP

For a standard shared-memory multi-core CPU, OpenMP offers a direct and efficient method for parallelization.

The Code

A standard C++ implementation of matrix multiplication uses three nested loops. Parallelizing this with OpenMP is straightforward: a single directive, #pragma omp parallel for, is placed before the outermost loop. This instructs the compiler to distribute the iterations of the i loop—which correspond to the rows of the output matrix C—among a team of parallel threads.¹

C++

// C, A, and B are N x N matrices
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        double sum = 0.0;
        for (int k = 0; k < N; k++) {
            sum += A[i][k] * B[k][j];
        }
        C[i][j] = sum;
    }
}

Logical-to-Physical Mapping

The process from this pragma to hardware execution requires precise collaboration between the compiler, the OpenMP runtime, and the operating system:

Stage	Component	Action	Result
1. Compile-Time	Compiler	Transforms code upon encountering `#pragma omp parallel for`⁴	Generates function for loop body; inserts OpenMP runtime calls
2. Runtime Initialization	OpenMP Runtime	Creates team of software threads at program execution⁵	Typically one thread per available CPU core
3. Thread-to-Core Mapping	OS Scheduler + OpenMP Runtime	Maps software threads to physical CPU cores using affinity policies⁶	Threads bound to specific cores based on policy

Thread Affinity: The mapping of software threads to physical CPU cores, critical for performance. Programmers control this via environment variables: OMP_PLACES defines available hardware resources (cores, sockets); OMP_PROC_BIND specifies placement policy.⁶⁷

Thread Affinity Policies:

Policy	Behavior	Best For	Example Use Case
close	Pack threads onto cores within a single socket before using another⁷	Applications with high data sharing	Threads frequently access same cache lines
spread	Distribute threads as widely as possible across all sockets and cores⁷	Applications with independent threads	Maximize aggregate memory bandwidth
master	Threads colocate near the master thread	NUMA-aware applications	Minimize remote memory access

Hardware Considerations in Execution

Once threads are executing on cores, their performance is determined by the complex behavior of the CPU’s memory subsystem.

The Memory Abstraction Gap: The programmer’s view of memory as a simple, flat, shared space is a useful abstraction, but it is built upon a complex physical reality involving caches, coherence protocols, and non-uniform memory access.

Cache Coherence

Each CPU core has private L1 and L2 caches to minimize memory access latency.⁸ When multiple threads on different cores compute rows of matrix C, they all write to a shared data structure, introducing a cache coherence problem.

The Coherence Problem: A core’s private cache may hold a “stale” copy of a memory location that has been updated by another core. Multi-core CPUs implement hardware cache coherence protocols (typically MESI: Modified, Exclusive, Shared, Invalid) to maintain consistency.⁸⁹

MESI Protocol Operation:

Caches “snoop” on a shared bus
When one core writes to a cache line, it broadcasts an invalidation signal
Other cores with that line mark their copies as invalid
Next access by another core causes cache miss → fetch updated version

False Sharing: A performance hazard where threads write to nearby memory locations on the same cache line, causing excessive invalidation traffic despite no logical data sharing.⁸

Non-Uniform Memory Access (NUMA)

In multi-socket systems, shared memory is physically distributed.

Memory Type	Access Latency	Scenario
Local Memory	Low (fast)	Core accesses memory bank on its own socket¹⁰
Remote Memory	High (slow)	Core accesses memory bank on different socket¹⁰

NUMA Impact: If a thread on Socket 0 computes a row of C whose data is allocated in Socket 1’s memory, every access incurs the higher latency of traversing the inter-socket interconnect. NUMA-aware scheduling (e.g., OMP_PROC_BIND=close) co-locates threads and data on the same NUMA node, minimizing remote access penalties.¹⁰¹¹

9.3.3 Mapping to a GPU with CUDA

Mapping the same matrix multiplication problem to a GPU demonstrates a completely different architectural philosophy and, as a result, a different parallelization strategy.

The Code

A CUDA implementation includes two parts: host code that runs on the CPU and a kernel that runs on the GPU. The host is responsible for allocating memory on both the host and the device, transferring the input matrices A and B to the GPU, launching the kernel, and transferring the resulting matrix C back from the GPU.¹²

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>

__global__ void matrixMulKernel(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

int main() {
    const int N = 1024;
    size_t size = N * N * sizeof(float);

    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    for (int i = 0; i < N * N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks((N + 15) / 16, (N + 15) / 16);
    matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);

    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("Result: C[0][0] = %.2f\n", h_C[0]);

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Compilation:

nvcc -o matrix_mul matrix_mul.cu
./matrix_mul

Logical-to-Physical Mapping

The CUDA mapping is hierarchical and explicitly defined by the programmer, which translates directly to GPU hardware structures:¹³

Hierarchy Level	Logical Construct	Physical Hardware	Mapping Details
1. Grid	Grid of Thread Blocks	Entire GPU	Programmer organizes computation into 2D/3D grid; each thread computes one output element¹⁴
2. Block	Thread Block	Streaming Multiprocessor (SM)	CUDA runtime distributes blocks to SMs; multiple blocks can reside on one SM¹²
3. Warp	Group of 32 Threads	Warp Scheduler + CUDA Cores	Hardware groups threads into warps; SM’s Warp Scheduler issues instructions¹²

Single Instruction, Multiple Thread (SIMT): All 32 threads in a warp execute the same instruction simultaneously but on different data. If a warp waits for memory, the scheduler instantly switches to another resident warp, hiding latency and keeping computational units busy.¹²

CUDA Execution Hierarchy:

Grid (entire GPU)
└── Thread Blocks (mapped to SMs)
    └── Warps (32 threads, scheduled by SM)
        └── Individual Threads (execute on CUDA cores)

Key GPU Architecture Concepts:

Component	Role	Count/Capability
Streaming Multiprocessor (SM)	Core processing unit of GPU¹²	Multiple per GPU; runs multiple blocks concurrently
Warp	Hardware thread grouping¹²	32 threads; execute in lockstep (SIMT)
Warp Scheduler	Selects warps for execution¹²	Hides memory latency via rapid context switching
CUDA Core	Arithmetic logic unit	Hundreds per SM; execute thread instructions

Hardware Considerations in Execution

The GPU’s performance is closely tied to its explicit and non-uniform memory hierarchy.

CPU vs. GPU Memory Philosophy: The CPU attempts to hide memory complexity through automatic caching. The GPU exposes its memory hierarchy to the programmer, who must manage it explicitly to achieve high performance.¹⁵¹²

GPU Memory Hierarchy:

Memory Type	Scope	Speed	Size	Programmer Control	Typical Use
Registers	Per-thread private	Fastest	Very small	Automatic (compiler-managed)	Loop variables, temporaries
Shared Memory	Per-block shared	Very fast (on-chip)	Small (KB per SM)	Explicit	Cooperative data sharing; programmer-managed cache¹²
Global Memory	All threads	Slow (high latency)	Large (GB DRAM)¹⁵	Explicit	Input/output data; main data storage
Constant Memory	Read-only, all threads	Fast (cached)	Small	Explicit	Read-only lookup tables
Texture Memory	Read-only, all threads	Fast (cached, optimized for 2D locality)	Large	Explicit	Image processing

The Primary GPU Bottleneck: Global memory access. The naive kernel performs numerous accesses to slow global memory, making it highly inefficient. The solution is tiling.¹⁴

Optimization via Tiling:

Problem: Each thread performs many slow global memory accesses
Solution: Thread block cooperatively loads sub-matrices (tiles) of A and B from global memory into fast shared memory
Benefit: Threads compute using data exclusively from shared memory
Result: Dramatically reduced global memory traffic; maximized data reuse in fast memory¹⁴

Effective CUDA Programming: Structure algorithms to maximize computation on data in registers and shared memory while minimizing global memory access. This optimization strategy defines the difference between naive and high-performance GPU code.¹⁴

9.3.4 Synthesis: Contrasting CPU and GPU Mapping

The parallelization of matrix multiplication on a CPU versus a GPU highlights fundamental differences in their architectural philosophies.

Architectural Optimization Goals: The CPU is optimized for low-latency execution of a few complex tasks. The GPU is optimized for high-throughput execution of many simple tasks. This fundamental difference drives entirely different mapping strategies, memory considerations, and optimization goals.

Key Architectural Contrasts:

Dimension	Multi-Core CPU	Many-Core GPU
Unit of Parallelism	Heavyweight thread (OS/runtime-managed); small number mapped to powerful cores	Lightweight thread (hardware-managed in warps); thousands of threads
Scheduling Approach	Complex software-based (OS + runtime); sophisticated preemption and work-stealing for fairness/load balancing	Simple hardware-based warp scheduler; rapid context switching to hide memory latency
Memory Philosophy	Hardware complexity (large caches, coherence protocols) provides illusion of uniform, coherent shared memory	Exposes physical memory hierarchy; requires explicit programmer management for performance
Optimization Focus	Maximize cache hits; ensure NUMA-local access; minimize synchronization overhead	Maximize shared memory use; minimize global memory access; maximize warp occupancy
Parallelism Granularity	Coarse (rows/chunks of matrix)	Fine (individual matrix elements)

The Architecture-Algorithm Dependency: An algorithm optimized for one architecture can be highly inefficient on another. The optimal parallelization method is not universal but architecture-specific—it answers: “What is the most effective way to map this computation onto the resources and constraints of this particular hardware?”

These contrasts demonstrate why portable performance across diverse hardware remains a fundamental challenge in parallel computing. The same algorithm requires radically different implementations to achieve efficiency on different architectures.

Feature	Multi-Core CPU (with OpenMP)	Many-Core GPU (with CUDA)
Programming Construct	`#pragma omp parallel for`	`__global__` kernel launch `<<<...>>>`
Logical Parallel Unit	Loop Iteration (e.g., a matrix row)	Single Data Element (e.g., a matrix cell)
Physical Execution Unit	OS Thread mapped to a CPU Core	CUDA Thread, executed in a Warp of 32
Scheduling	OS/Runtime scheduler (preemptive, work-stealing possible)	Hardware Warp Scheduler (non-preemptive, latency hiding)
Key Memory Challenge	Cache Coherence (MESI), NUMA locality	Global Memory Latency
Optimization Goal	Maximize cache hits, ensure NUMA-local access	Maximize use of Shared Memory, minimize global memory access

References

Parallel Programming With OpenMP In C++ - Matrix Multiplication …, accessed October 7, 2025, https://www.c-sharpcorner.com/article/parallel-programming-with-openmp-in-cpp-matrix-multiplication-example/ ↩ ↩²
5.2 Matrix Multiplication — Parallel Computing for Beginners - Free Hands-On Materials for Learning PDC, accessed October 7, 2025, https://www.learnpdc.org/PDCBeginners/5-applications/matrix-multiply.html ↩ ↩² ↩³
Parallelized Blocked Matrix Multiplication using OpenMP | by Charith Pietersz - Medium, accessed October 7, 2025, https://medium.com/@cj.ptsz/parallelized-blocked-matrix-multiplication-using-openmp-97a4bc620a47 ↩
Pragma directives for OpenMP parallelization - IBM, accessed October 7, 2025, https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.1?topic=reference-pragma-directives-openmp-parallelization ↩
An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html ↩
OpenMP thread mapping to physical cores - Stack Overflow, accessed October 7, 2025, https://stackoverflow.com/questions/4717251/openmp-thread-mapping-to-physical-cores ↩ ↩²
Processor Affinity for OpenMP and MPI » ADMIN Magazine, accessed October 7, 2025, https://www.admin-magazine.com/HPC/Articles/Processor-Affinity-for-OpenMP-and-MPI ↩ ↩² ↩³
Multicore processors and cache coherence | Intro to Computer Architecture Class Notes, accessed October 7, 2025, https://fiveable.me/introduction-computer-architecture/unit-7/multicore-processors-cache-coherence/study-guide/0Fr1h83bSS0NcUMH ↩ ↩² ↩³
Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs …, accessed October 7, 2025, https://dev.to/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j ↩
Non-uniform memory access - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Non-uniform_memory_access ↩ ↩² ↩³
Improving Parallel System Performance with a NUMA-aware Load Balancer - IDEALS, accessed October 7, 2025, https://www.ideals.illinois.edu/items/26080/bitstreams/89289/object?dl=1 ↩
CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
CUDA C++ Programming Guide - NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation ↩
Matrix Multiplication in CUDA - Harshit Kumar, accessed October 7, 2025, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda ↩ ↩² ↩³ ↩⁴
CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf ↩ ↩²

9.3 Mapping Constructs to Hardware

9.3.1 The Canonical Problem: Matrix Multiplication

9.3.2 Mapping to a Multi-Core CPU with OpenMP

The Code

Logical-to-Physical Mapping

Hardware Considerations in Execution

9.3.3 Mapping to a GPU with CUDA

The Code

Logical-to-Physical Mapping

Hardware Considerations in Execution

9.3.4 Synthesis: Contrasting CPU and GPU Mapping

References

Footnotes