9.1 Parallel Programming Models
The field of parallel programming encompasses a diverse range of models, languages, and libraries. These have evolved over several decades to adapt to the changing landscape of computer architecture.1 To select the appropriate tool for a given task, one must understand the fundamental principles that differentiate these models. These distinctions are primarily driven by two fundamental dichotomies: how the underlying hardware organizes memory and how computational work is decomposed.2
9.1.1 Foundational Dichotomies: Memory and Work Decomposition
Section titled “9.1.1 Foundational Dichotomies: Memory and Work Decomposition”The design of a parallel programming model is significantly influenced by the machine’s memory system architecture. This characteristic determines how processing elements communicate and shapes the programmer’s approach to problem-solving.
Shared vs. Distributed Memory Architectures
Section titled “Shared vs. Distributed Memory Architectures”Parallel computer architectures are broadly classified into two families based on how processors access memory.3
| Architecture Type | Memory Organization | Communication Method | Key Advantages | Key Challenges | Examples |
|---|---|---|---|---|---|
| Shared Memory | Single global address space accessible by all processors4 | Implicit through shared variables5 | Simpler programming model; implicit communication | Race conditions; synchronization overhead; limited scalability due to memory bus contention36 | Multi-core CPUs, desktops, laptops, servers7 |
| Distributed Memory | Each processor has private local memory4 | Explicit message passing4 | High scalability; adding nodes increases compute power and memory | Complex programming; must manage data partitioning, distribution, and communication4 | HPC clusters, supercomputers8 |
| Distributed Shared Memory (DSM) | Illusion of shared address space over distributed hardware4 | Appears shared but uses message passing underneath | Simpler than pure distributed memory | Hidden but present communication latency4 | Research systems, some commercial platforms |
| PGAS (Partitioned Global Address Space) | Logically partitioned global address space with processor affinity6 | Shared-memory-style reads/writes with locality awareness | Balance of programmability and performance; locality control | Requires careful data placement | UPC, Chapel, X10 languages |
The Fundamental Trade-off: Shared-memory programs focus on synchronization (preventing race conditions through locks, semaphores, and atomic operations), while distributed-memory programs emphasize data movement (managing explicit communication between processors). Migrating an application from one model to the other necessitates complete re-evaluation of the algorithm’s data structures and communication patterns.456
Data Parallelism vs. Task Parallelism
Section titled “Data Parallelism vs. Task Parallelism”The second fundamental dichotomy concerns how a problem is decomposed into parallel units of work.2
| Parallelism Type | Definition | Source of Parallelism | Best Suited For | Common Examples | Architectural Alignment |
|---|---|---|---|---|---|
| Data Parallelism | Distribute large dataset across multiple processing units; apply same operation to each subset2 | The data, not the operations | Regularly structured data (arrays, matrices)6 | Vector processing in simulations; image filters applied to pixels; deep learning training on batches9 | SIMD, SPMD10 |
| Task Parallelism | Distribute distinct tasks across different processors for concurrent execution9 | Different operations/functions | Irregular or heterogeneous problems; workflows with dependencies10 | Pipelining; concurrent atmosphere and ocean models in climate simulation | MIMD2 |
Real-World Hybrid Approach: Few applications are exclusively data-parallel or task-parallel. Most exhibit a combination of both at various program levels. A climate simulation might employ data parallelism to update grid cells within a time step (same operation, different data) and task parallelism to concurrently execute separate atmosphere and ocean models (different tasks).10
9.1.2 Models for Shared Memory Systems
Section titled “9.1.2 Models for Shared Memory Systems”Programming models for shared-memory systems utilize the global address space, primarily offering mechanisms to create concurrent threads of execution and manage their access to shared data. The development of these models demonstrates an ongoing effort to increase abstraction, thereby protecting the programmer from the complexities of direct thread management while maintaining high performance.
Directive-Based Parallelism: OpenMP and the Fork-Join Model
Section titled “Directive-Based Parallelism: OpenMP and the Fork-Join Model”Open Multi-Processing (OpenMP) is the primary standard for high-level, directive-based parallel programming on shared-memory systems.11 OpenMP is a specification, not a library, implemented by compiler vendors (e.g., GCC, Intel C++ Compiler). It enables programmers to incrementally parallelize serial C, C++, and Fortran code using special preprocessor directives, known as pragmas.12
The Fork-Join Execution Model: An OpenMP program starts as a single master thread. When encountering a parallel region, it “forks,” creating a team of parallel worker threads. All threads execute the region concurrently. At the end, worker threads “join” back with the master thread, which resumes serial execution until the next parallel region.12
Key Features of OpenMP:
- Loop Parallelization: Adding
#pragma omp parallel forbefore a loop automatically distributes iterations among available threads12 - Data Scope Control: Clauses manage variable visibility (e.g.,
private,shared)11 - Reduction Operations: Built-in support for common reductions (e.g.,
reduction(+:sum))11 - Scheduling Control: Various policies for iteration distribution among threads11
OpenMP’s Key Advantage: Incremental parallelization—developers can achieve significant performance improvements on multi-core processors with minimal code changes, making it fundamental to HPC.712
Explicit Threading: POSIX Threads (Pthreads)
Section titled “Explicit Threading: POSIX Threads (Pthreads)”In contrast to OpenMP, POSIX Threads (Pthreads) represent a lower level of abstraction. Pthreads is a low-level C programming API, standardized by the IEEE, offering direct and explicit control over threads.7 This library-based approach requires including the <pthread.h> header and linking against the Pthreads library.13
With Pthreads, the programmer manages all aspects of the thread lifecycle.13 New threads are created using the pthread_create() function, which specifies the function the new thread will execute.14 The main program can await a thread’s completion using pthread_join().13 Since all threads within a process share the same memory space, programmers must manually manage synchronization to prevent data races. Pthreads offers a suite of synchronization primitives, with the mutual exclusion lock, or mutex (pthread_mutex_t), being the most fundamental. A thread must acquire a lock using pthread_mutex_lock() before entering a critical section that accesses shared data and release it with pthread_mutex_unlock() upon exit, ensuring exclusive access to the critical section.13
This model provides extensive control and flexibility, but at the cost of significant programming complexity. Developers must identify and manage all potential data races, synchronization, and manual work partitioning among threads. Consequently, while Pthreads is powerful for system-level programming, it is often too cumbersome for general application development, where higher-level models are typically preferred.7
Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)
Section titled “Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)”Intel’s Threading Building Blocks (now oneAPI Threading Building Blocks, or oneTBB) presents a modern C++ approach that balances the high-level simplicity of OpenMP with the explicit control of Pthreads.15 TBB is a C++ template library that abstracts raw threads, encouraging programmers to conceptualize work in terms of tasks.16
TBB’s philosophy promotes data-parallel programming using high-level parallel algorithms that operate on data collections, similar to the C++ Standard Template Library (STL).17 For instance, instead of manually creating threads for loop parallelization, a developer can use TBB’s tbb::parallel_for algorithm, providing it with a range of indices and a C++ lambda function or function object that defines the work for each iteration.16
The efficacy of TBB stems from its sophisticated runtime scheduler.18 In contrast to the often static scheduling of basic OpenMP, TBB utilizes a dynamic work-stealing scheduler to achieve automatic load balancing.19 In this model, each thread maintains its own task queue. When a thread becomes idle, it attempts to “steal” a task from the queue of another busy thread.19 This mechanism effectively manages applications with irregular or unpredictable task execution times, ensuring continuous processor core utilization without requiring manual programmer intervention.16 The combination of high-level C++ abstractions and an intelligent runtime positions TBB as a robust tool for developing complex, scalable parallel applications on shared-memory systems.
9.1.3 Models for Distributed Memory Systems
Section titled “9.1.3 Models for Distributed Memory Systems”When computation extends beyond a single machine to clusters of interconnected nodes, the shared-memory assumption becomes invalid. Programming models for these systems must address the physical reality of distributed memory, prioritizing data locality and explicit communication.
Message Passing: The MPI Standard
Section titled “Message Passing: The MPI Standard”The Message Passing Interface (MPI) is the established standard for programming distributed-memory parallel computers, ranging from small clusters to the largest supercomputers globally.5 MPI is a specification for a library of functions, not a language itself, that can be invoked from C, C++, and Fortran to manage parallel processes and their communication.20
MPI Fundamental Concepts: An MPI program consists of independent processes, often on distinct physical nodes. Processes are organized into a communicator (default:
MPI_COMM_WORLD). Each process has a unique integer rank (starting from 0) within its communicator.20
Communication in MPI is explicit, relying on messages sent and received between processes identified by their ranks.20
MPI Communication Patterns
Section titled “MPI Communication Patterns”| Type | Operations | Description | Use Case |
|---|---|---|---|
| Point-to-Point | MPI_Send(), MPI_Recv() | Single sender, single receiver. Sender packages data and transmits to destination rank; receiver blocks until message arrives21 | Direct data exchange between specific processes |
| Broadcast | MPI_Bcast() | One process sends identical data to all other processes21 | Distributing global parameters or configuration |
| Scatter | MPI_Scatter() | One process distributes distinct portions of an array to all processes21 | Partitioning data for parallel processing |
| Gather | MPI_Gather() | Inverse of Scatter—collects distinct data from all processes to one21 | Collecting results after parallel computation |
| Reduction | MPI_Reduce() | Combines data from all processes into single result using specified operation (e.g., sum, max)21 | Computing global aggregates like totals or maximums |
MPI’s Design Philosophy: By requiring explicit management of data placement and inter-process communication, MPI offers maximum control over parallel execution in distributed systems. Though more complex than shared-memory models, its portability and scalability have made it fundamental to high-performance computing.20
9.1.4 Models for Heterogeneous and Accelerator-Based Systems
Section titled “9.1.4 Models for Heterogeneous and Accelerator-Based Systems”The emergence of specialized hardware accelerators, particularly Graphics Processing Units (GPUs), has added a new dimension to parallel programming. These devices provide significant computational power but have distinct architectures and memory systems, requiring specialized programming models to manage the complex interaction between a host CPU and an accelerator device.
GPU Computing: NVIDIA’s CUDA and the OpenCL Standard
Section titled “GPU Computing: NVIDIA’s CUDA and the OpenCL Standard”The evolution of GPUs from fixed-function graphics pipelines to fully programmable parallel processors for general-purpose computing (GPGPU) marked a significant turning point in HPC.7
CUDA (Compute Unified Device Architecture)
Developed by NVIDIA, CUDA is a proprietary parallel computing platform and programming model for its GPUs.22 It extends C++ with language extensions and a runtime library.
CUDA’s Heterogeneous Model: Explicitly based on interaction between a host (CPU) and devices (GPUs). Host and device have separate memory spaces (system RAM and GPU VRAM), requiring explicit data transfers via
cudaMemcpy(). Parallel computations are written as kernels (designated by__global__), launched from the host with syntax specifying parallel thread count.2324
| Component | Role | Key Operations |
|---|---|---|
| Host (CPU) | Orchestrates computation; manages data transfers | Allocate GPU memory, transfer data, launch kernels |
| Device (GPU) | Executes massively parallel computations | Run thousands of threads executing kernel code |
| Kernel | GPU function executed by many threads in parallel | Designated by __global__; launched with <<<...>>> syntax24 |
OpenCL (Open Computing Language)
In response to demand for a vendor-neutral standard, the Khronos Group developed OpenCL.25
OpenCL’s Portability Promise: Open, royalty-free standard for writing programs that execute on various platforms (CPUs, GPUs, DSPs, FPGAs). Theoretically, an OpenCL program can run on any compliant hardware from any vendor.2526
| Feature | CUDA | OpenCL |
|---|---|---|
| Vendor | NVIDIA (proprietary) | Khronos Group (open standard) |
| Hardware Support | NVIDIA GPUs only | CPUs, GPUs, DSPs, FPGAs from multiple vendors25 |
| Programming Model | Host-device with explicit memory management | Host-device with context, command queues, kernels2627 |
| Kernel Language | CUDA C/C++ | OpenCL C (C99-based dialect)26 |
| Key Advantage | Highly optimized for NVIDIA hardware; mature ecosystem7 | True cross-platform portability25 |
| Trade-off | Vendor lock-in | May sacrifice performance for genericity28 |
Modern C++ Abstractions: SYCL
Section titled “Modern C++ Abstractions: SYCL”The complexity of managing separate host and device codebases in CUDA and OpenCL prompted the development of higher-level abstractions. SYCL (pronounced “sickle”) is a royalty-free, cross-platform abstraction layer from the Khronos Group that allows code for heterogeneous processors to be written in a single-source style using standard C++.29
SYCL’s Single-Source Innovation: Both host and device code in the same source file, using modern C++ features (classes, templates, lambda functions). The SYCL runtime automatically manages kernel execution and data transfers through buffer and accessor objects, abstracting hardware complexity from the developer.2930
SYCL Architecture:
| Component | Function |
|---|---|
| Queue Object | Submits work to compute devices30 |
| Command Group | Encapsulates kernel and data dependencies (defined in lambda)30 |
| Buffer & Accessor Objects | Describe and manage data transfers automatically29 |
| Backend | Operates on top of OpenCL, CUDA, or other compute APIs29 |
This single-source approach enhances programmability and code maintainability for complex heterogeneous systems, representing a significant step in abstracting hardware complexity from the developer.29
9.1.5 Comparative Analysis of Programming Models
Section titled “9.1.5 Comparative Analysis of Programming Models”The choice of a parallel programming model requires balancing trade-offs between performance, programmability, portability, and the target hardware. The evolution from low-level models such as Pthreads, which provide maximum control at the cost of high complexity, to high-level abstractions like SYCL, which prioritize programmer productivity and portability, directly addresses the increasing complexity of parallel hardware.7 This progression is a necessary adaptation to make parallel programming more accessible. The optimal model is context-dependent, determined by the application’s specific requirements, the available hardware, and development resources. The following table summarizes the key models discussed, outlining their defining characteristics and typical use cases.
| Model | Primary Memory Model | Parallelism Type | Abstraction Level | Primary Use Case | Programmability |
|---|---|---|---|---|---|
| Pthreads | Shared | Task | Low | Fine-grained thread control, systems programming | Low |
| OpenMP | Shared | Data / Task | High | Incremental parallelization of loops on multi-core CPUs | High |
| oneTBB | Shared | Data / Task | High | C++ task-based parallelism with dynamic load balancing | High |
| MPI | Distributed | Data / Task | Low | Large-scale cluster and supercomputer programming | Low |
| CUDA | Host-Device | Data | Medium | High-performance computing on NVIDIA GPUs | Medium |
| OpenCL | Host-Device | Data / Task | Medium | Portable parallel programming for heterogeneous accelerators | Medium |
| SYCL | Host-Device | Data / Task | High | Single-source C++ for portable heterogeneous programming | High |
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Models for Parallel Computing : Review and Perspectives - Semantic Scholar, accessed October 7, 2025, https://www.semanticscholar.org/paper/Models-for-Parallel-Computing-%3A-Review-and-Kessler-Keller/c924481fbb05bb807920c8f3f2f4d9234c9f1c29 ↩
-
Insights on Parallel Programming Model - Advanced Millennium Technologies, accessed October 7, 2025, https://blog.amt.in/index.php/2023/01/17/insights-on-parallel-programming-model/ ↩ ↩2 ↩3 ↩4
-
Shared Versus Distributed Memory Multiprocessors - ECMWF, accessed October 7, 2025, https://www.ecmwf.int/sites/default/files/elibrary/1990/10302-shared-versus-distributed-memory-multiprocessors.pdf ↩ ↩2
-
Distributed memory - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Distributed_memory ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
How Parallel Processing Shaped Modern Computing - CelerData, accessed October 7, 2025, https://celerdata.com/glossary/how-parallel-processing-shaped-modern-computing ↩ ↩2 ↩3
-
Parallel programming model - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Parallel_programming_model ↩ ↩2 ↩3 ↩4
-
The Evolution of Parallel Programming | by Tiwariabhinav | Medium, accessed October 7, 2025, https://medium.com/@tiwariabhinav424/the-evolution-of-parallel-programming-d80665066b88 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Parallel Computing at a Glance, accessed October 7, 2025, http://www.buyya.com/microkernel/chap1.pdf ↩
-
Data Parallel, Task Parallel, and Agent Actor Architectures - bytewax, accessed October 7, 2025, https://bytewax.io/blog/data-parallel-task-parallel-and-agent-actor-architectures ↩ ↩2
-
Task parallelism - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Task_parallelism ↩ ↩2 ↩3
-
Using OpenMP with C — Research Computing University of …, accessed October 7, 2025, https://curc.readthedocs.io/en/latest/programming/OpenMP-C.html ↩ ↩2 ↩3 ↩4
-
An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html ↩ ↩2 ↩3 ↩4
-
Multithreaded Programming (POSIX pthreads Tutorial) - randu.org, accessed October 7, 2025, https://randu.org/tutorials/threads/ ↩ ↩2 ↩3 ↩4
-
C++ Tutorial: Multi-Threaded Programming - C++ Class Thread for Pthreads - 2020 - BogoToBogo, accessed October 7, 2025, https://www.bogotobogo.com/cplusplus/multithreading_pthread.php ↩
-
Getting Started with Intel® Threading Building Blocks (Intel® TBB), accessed October 7, 2025, https://www.intel.com/content/www/us/en/developer/articles/guide/get-started-with-tbb.html ↩
-
Introduction to the Intel Threading Building Blocks — mcs572 0.7.8 documentation, accessed October 7, 2025, http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec11.html ↩ ↩2 ↩3
-
TBB Tutorial - cs.wisc.edu, accessed October 7, 2025, https://pages.cs.wisc.edu/~gibson/tbbTutorial.html ↩
-
Introducing Intel Tbb - Agenda INFN, accessed October 7, 2025, https://agenda.infn.it/event/4107/contributions/50346/attachments/35473/41878/Tbb_Introduction.pdf ↩
-
Threading Building Blocks - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Threading_Building_Blocks ↩ ↩2
-
MPI Tutorial – Part 1, accessed October 7, 2025, https://spcl.inf.ethz.ch/Teaching/2017-dphpc/recitation/mpi1.pdf ↩ ↩2 ↩3 ↩4
-
Tutorials · MPI Tutorial, accessed October 7, 2025, https://mpitutorial.com/tutorials/ ↩ ↩2 ↩3 ↩4 ↩5
-
(PDF) Review of Architecture and Model for Parallel Programming - ResearchGate, accessed October 7, 2025, https://www.researchgate.net/publication/372822306_Review_of_Architecture_and_Model_for_Parallel_Programming ↩
-
CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf ↩
-
CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 ↩ ↩2
-
OpenCL - The Open Standard for Parallel Programming of …, accessed October 7, 2025, https://www.khronos.org/opencl/ ↩ ↩2 ↩3 ↩4
-
OpenCL - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/OpenCL ↩ ↩2 ↩3
-
OpenCL execution model - Arm Developer, accessed October 7, 2025, https://developer.arm.com/documentation/dui0538/e/opencl-concepts/opencl-execution-model ↩
-
An introduction to OpenCL - Purdue Engineering, accessed October 7, 2025, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod20OpenCL/3rd-Edition-AppendixA-intro-to-OpenCL.pdf ↩
-
Heterogeneous programming with SYCL documentation, accessed October 7, 2025, https://enccs.github.io/sycl-workshop/ ↩ ↩2 ↩3 ↩4 ↩5
-
Getting Started - SYCL.tech, accessed October 7, 2025, https://sycl.tech/getting-started/ ↩ ↩2 ↩3