7.1 GPGPU Programming Model

The programming model for a Graphics Processing Unit (GPU) is fundamentally different from that of a Central Processing Unit (CPU). This difference stems from their distinct architectural optimizations: GPUs are optimized for high throughput, while CPUs are optimized for low latency. A comprehensive understanding of the GPU programming model, including its execution model, thread hierarchy, and memory structure, is necessary for efficient General-Purpose GPU (GPGPU) application development.

CPU vs. GPU: Architectural Design for Latency and Throughput

The architectural designs of CPUs and GPUs reflect their specialization for different computational tasks, leading to significant performance differences in parallel workloads.

CPU: Latency Optimization: A CPU is optimized to minimize the execution time, or latency, of a single instruction stream (a thread).¹ It utilizes a few powerful cores with high clock speeds. A significant portion of the chip’s area is allocated to control logic (e.g., branch prediction, out-of-order execution) and large cache memories to expedite single-thread performance.¹
GPU: Throughput Optimization: A GPU is optimized to maximize the total number of operations completed per unit of time, or throughput.¹ It contains a large number of simpler arithmetic logic units (ALUs) organized into cores.² GPUs have less complex control logic and smaller caches compared to CPUs. To manage memory latency, a GPU’s scheduler switches execution to other ready threads when one group of threads is stalled, thereby maintaining high utilization of the computational units.¹

The SIMT (Single Instruction, Multiple Threads) Execution Model

The primary execution model for GPUs is Single Instruction, Multiple Threads (SIMT).³ SIMT provides a programming abstraction that combines the efficiency of a Single Instruction, Multiple Data (SIMD) architecture with a more straightforward programming approach.

In the SIMT model, a programmer writes code for a single, scalar thread. The program is then executed by thousands of threads in parallel, each with its own program counter and state.⁴ This model abstracts away the need for manual data vectorization, which is typical in SIMD programming.⁴

The hardware groups threads into fixed-size blocks for execution. In NVIDIA’s CUDA architecture, this group is a warp (typically 32 threads).⁵ In AMD’s ROCm platform, it is a wavefront (historically 64 threads, now often 32).³ All threads within a warp execute the same instruction in lock-step on different data. This hardware-level grouping improves efficiency by using a single instruction fetch/decode unit for all threads in the warp, allowing more silicon to be dedicated to ALUs.³

A key performance consideration in the SIMT model is branch divergence. Because all threads in a warp share a program counter, they must execute the same instruction at the same time. If a conditional branch causes threads within a warp to follow different execution paths, the hardware serializes the paths. Threads taking one path execute while the others are masked (deactivated). Then, the situation is reversed. This serialization leaves computational resources idle and can significantly degrade performance. Therefore, GPGPU algorithms should be designed to minimize branch divergence within a warp.³

Hierarchical Thread Organization: Grids, Blocks, and Threads

GPGPU programming models like CUDA and OpenCL use a hierarchical structure to manage the large number of threads. This abstraction allows programmers to organize parallel tasks logically and enables the hardware to schedule work efficiently across different GPU architectures.

The hierarchy has three levels:

Thread: The fundamental unit of execution. Each thread executes an instance of the kernel function and is identified by a unique ID within its block.⁶
Block (or Work-Group in OpenCL): A group of threads organized in a one-, two-, or three-dimensional structure. Threads within a block can cooperate using fast, on-chip shared memory and can synchronize their execution.⁷ All threads in a block are executed on the same Streaming Multiprocessor (SM).⁶
Grid: A collection of blocks organized in a one-, two-, or three-dimensional structure. The grid encompasses all threads for a single kernel launch.⁶

This hierarchy maps the software model to the physical hardware. When a kernel is launched, the grid of blocks is distributed among the GPU’s SMs. Each SM can execute one or more blocks concurrently, depending on the resources (e.g., registers, shared memory) required by each block.⁶ The threads within each block are then executed by the SM’s cores in warps.

A critical aspect of this model is that threads within a block can communicate and synchronize, but threads in different blocks operate independently and cannot directly communicate.⁶ This independence allows blocks to be scheduled in any order on any available SM, which is the key to the model’s scalability. Code written using this model can automatically scale to run on future GPUs with more SMs, as the runtime system will distribute the blocks across the larger number of processors.⁸

The GPU Memory Hierarchy

A GPU’s computational throughput is effective only if its cores are continuously supplied with data. As a result, GPGPU performance is often limited by memory access rather than computation.⁹ Effective use of the GPU’s multi-layered memory hierarchy is essential for achieving high performance and avoiding memory bottlenecks.¹⁰

The memory hierarchy involves a trade-off between speed, size, and scope. The different levels of memory are designed to correspond with the thread hierarchy.

Memory Type	Location	Scope	Access Speed	Typical Capacity	Primary Use Case
Registers	On-Chip (in SM)	Per-Thread	Fastest (~1 cycle)	Kilobytes per SM	Frequently accessed thread-private variables.
Shared Memory	On-Chip (in SM)	Per-Block	Very Fast (~10s of cycles)	Tens of Kilobytes per SM	User-managed cache; inter-thread communication within a block.
L1/L2 Cache	On/Off-Chip	Per-SM / Per-Device	Fast	KB (L1) / MB (L2)	Hardware-managed cache for global/local memory accesses.
Global Memory	Off-Chip (DRAM)	Per-Grid (Device-wide)	Slow (~100s of cycles)	Gigabytes	Main data storage for kernel input/output.
Constant Memory	Off-Chip (DRAM), Cached	Per-Grid (Device-wide)	Fast (if cached)	Tens of Kilobytes	Read-only data broadcast to all threads (e.g., coefficients).
Texture Memory	Off-Chip (DRAM), Cached	Per-Grid (Device-wide)	Fast (if cached)	Gigabytes	Read-only data with spatial locality optimization.

A key performance concept related to global memory is coalesced memory access. When all 32 threads in a warp access contiguous locations in global memory, the hardware can group these requests into a single, large memory transaction, maximizing effective memory bandwidth. Conversely, scattered, random memory access patterns result in multiple inefficient transactions, which significantly reduces performance. The architecture thus favors algorithms with structured and predictable memory access patterns.

GPGPU Programming Frameworks: A Comparative Analysis

Several programming frameworks exist for GPGPU, each with different design philosophies, strengths, and weaknesses. The choice of framework affects performance, portability, and developer productivity.

NVIDIA CUDA: As the first major GPGPU platform, CUDA has a mature and extensive ecosystem.¹¹ It is a proprietary framework exclusive to NVIDIA GPUs.⁷ Its main advantages are its tight hardware integration, which often provides the highest performance, and its large collection of optimized libraries for specific domains (e.g., cuDNN for deep learning, cuBLAS for linear algebra). It also includes advanced developer tools like the Nsight profiler.¹² The primary disadvantage is vendor lock-in, as CUDA code is not portable to hardware from other manufacturers.⁷
OpenCL (Open Computing Language): An open, royalty-free standard from the Khronos Group, OpenCL’s main advantage is portability.¹¹ An OpenCL program can theoretically run on various hardware, including GPUs from NVIDIA, AMD, and Intel, as well as CPUs, FPGAs, and DSPs.⁷ However, this portability has drawbacks. The OpenCL standard can lag behind CUDA in supporting new hardware features, and vendor support may be inconsistent.¹³ Achieving optimal performance often requires hardware-specific optimizations, which can compromise the “write once, run anywhere” goal.¹¹ The API is also generally more verbose than CUDA’s.¹⁴
SYCL: Also a Khronos Group standard, SYCL is a higher-level programming model built on top of backends like OpenCL.⁷ It enables developers to write single-source, modern C++ code for heterogeneous systems, abstracting away much of the boilerplate associated with OpenCL.⁷ Its goal is to provide the portability of OpenCL with a more integrated programming experience. As a newer standard, its ecosystem is less mature than CUDA’s, and the level of abstraction can sometimes hinder fine-grained, hardware-specific optimizations.¹⁵
DirectCompute: Microsoft’s GPGPU API, part of the DirectX suite.¹³ It is primarily used on the Windows operating system, especially in game development for tasks like physics simulations and post-processing effects. It is less common in scientific high-performance computing and AI, which are dominated by CUDA and OpenCL.¹³

The following table provides a summary comparison of the major GPGPU frameworks.

Feature	NVIDIA CUDA	OpenCL	SYCL
Governing Body	NVIDIA (Proprietary)	Khronos Group (Open Standard)	Khronos Group (Open Standard)
Primary Language	C/C++ with extensions	C/C++ based kernel language	Modern C++ (single-source)
Hardware Support	NVIDIA GPUs only	CPUs, GPUs (NVIDIA, AMD, Intel), FPGAs, DSPs	CPUs, GPUs, FPGAs (via OpenCL or other backends)
Portability	Low (Vendor-specific)	High (Cross-vendor, cross-device)	High (Built on OpenCL/other backends)
Ecosystem & Libraries	Extremely mature and extensive (cuDNN, cuBLAS, etc.)	Less extensive; vendor-specific libraries exist	Growing, but less mature than CUDA
Performance	Typically highest on NVIDIA hardware due to tight integration	Can be high, but may require vendor-specific tuning	Performance is dependent on the underlying backend (e.g., OpenCL)
Ease of Use	High, with a well-documented, stable API	Moderate, more verbose and requires manual boilerplate	High, abstracts away boilerplate with modern C++ features

References

CUDA Refresher: Reviewing the Origins of GPU Computing | NVIDIA Technical Blog, accessed October 6, 2025, https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/ ↩ ↩² ↩³ ↩⁴
Runtime Comparison of CPU and GPU Using Portable … - SciSpace, accessed October 6, 2025, https://scispace.com/pdf/runtime-comparison-of-cpu-and-gpu-using-portable-programming-2hzy2njaya.pdf ↩
Single instruction, multiple threads - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads ↩ ↩² ↩³ ↩⁴
SIMT vs SIMD: Parallelism in Modern Processors - Benjamin H Glick, accessed October 6, 2025, https://www.glick.cloud/blog/simt-vs-simd-parallelism-in-modern-processors ↩ ↩²
Cornell Virtual Workshop > Understanding GPU Architecture > GPU …, accessed October 6, 2025, https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/simt_warp ↩
Thread block (CUDA programming) - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming) ↩ ↩² ↩³ ↩⁴ ↩⁵
Comparing SYCL, OpenCL, and CUDA: Matrix Multiplication …, accessed October 6, 2025, https://sgurwinderr.github.io/blog/sycl-opencl-cuda/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
CUDA programming model of threads, blocks, and grids, with… - ResearchGate, accessed October 6, 2025, https://www.researchgate.net/figure/CUDA-programming-model-of-threads-blocks-and-grids-with-corresponding-per-thread_fig3_224194485 ↩
Dissecting GPU Memory Hierarchy through Microbenchmarking - arXiv, accessed October 6, 2025, https://arxiv.org/pdf/1509.02308 ↩
Memory Hierarchy of GPUs - Arc Compute, accessed October 6, 2025, https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy ↩
Cuda OpenCL comparison cuda, openCL, nvidia - CUDA Programming and Performance, accessed October 6, 2025, https://forums.developer.nvidia.com/t/cuda-opencl-comparison-cuda-opencl-nvidia/14428 ↩ ↩² ↩³
GPU programming comparison: OpenCL vs Compute Shader vs CUDA vs Thrust - Reddit, accessed October 6, 2025, https://www.reddit.com/r/gamedev/comments/9pvq12/gpu_programming_comparison_opencl_vs_compute/ ↩
OpenCL vs. DirectCompute? - Stack Overflow, accessed October 6, 2025, https://stackoverflow.com/questions/3172220/opencl-vs-directcompute ↩ ↩² ↩³
CUDA vs OpenCL: Which One For GPU Programming? | Incredibuild, accessed October 6, 2025, https://www.incredibuild.com/blog/cuda-vs-opencl-which-to-use-for-gpu-programming ↩
SYCL, CUDA, and others --- experiences and future trends in heterogeneous C++ programming? : r/cpp - Reddit, accessed October 6, 2025, https://www.reddit.com/r/cpp/comments/1im99l2/sycl_cuda_and_others_experiences_and_future/ ↩