9.1 Parallel Programming Models

The field of parallel programming encompasses a diverse range of models, languages, and libraries. These have evolved over several decades to adapt to the changing landscape of computer architecture.¹ To select the appropriate tool for a given task, one must understand the fundamental principles that differentiate these models. These distinctions are primarily driven by two fundamental dichotomies: how the underlying hardware organizes memory and how computational work is decomposed.²

9.1.1 Foundational Dichotomies: Memory and Work Decomposition

The design of a parallel programming model is significantly influenced by the machine’s memory system architecture. This characteristic determines how processing elements communicate and shapes the programmer’s approach to problem-solving.

Shared vs. Distributed Memory Architectures

Parallel computer architectures are broadly classified into two families based on how processors access memory.³

Architecture Type	Memory Organization	Communication Method	Key Advantages	Key Challenges	Examples
Shared Memory	Single global address space accessible by all processors⁴	Implicit through shared variables⁵	Simpler programming model; implicit communication	Race conditions; synchronization overhead; limited scalability due to memory bus contention³⁶	Multi-core CPUs, desktops, laptops, servers⁷
Distributed Memory	Each processor has private local memory⁴	Explicit message passing⁴	High scalability; adding nodes increases compute power and memory	Complex programming; must manage data partitioning, distribution, and communication⁴	HPC clusters, supercomputers⁸
Distributed Shared Memory (DSM)	Illusion of shared address space over distributed hardware⁴	Appears shared but uses message passing underneath	Simpler than pure distributed memory	Hidden but present communication latency⁴	Research systems, some commercial platforms
PGAS (Partitioned Global Address Space)	Logically partitioned global address space with processor affinity⁶	Shared-memory-style reads/writes with locality awareness	Balance of programmability and performance; locality control	Requires careful data placement	UPC, Chapel, X10 languages

The Fundamental Trade-off: Shared-memory programs focus on synchronization (preventing race conditions through locks, semaphores, and atomic operations), while distributed-memory programs emphasize data movement (managing explicit communication between processors). Migrating an application from one model to the other necessitates complete re-evaluation of the algorithm’s data structures and communication patterns.⁴⁵⁶

Data Parallelism vs. Task Parallelism

The second fundamental dichotomy concerns how a problem is decomposed into parallel units of work.²

Parallelism Type	Definition	Source of Parallelism	Best Suited For	Common Examples	Architectural Alignment
Data Parallelism	Distribute large dataset across multiple processing units; apply same operation to each subset²	The data, not the operations	Regularly structured data (arrays, matrices)⁶	Vector processing in simulations; image filters applied to pixels; deep learning training on batches⁹	SIMD, SPMD¹⁰
Task Parallelism	Distribute distinct tasks across different processors for concurrent execution⁹	Different operations/functions	Irregular or heterogeneous problems; workflows with dependencies¹⁰	Pipelining; concurrent atmosphere and ocean models in climate simulation	MIMD²

Real-World Hybrid Approach: Few applications are exclusively data-parallel or task-parallel. Most exhibit a combination of both at various program levels. A climate simulation might employ data parallelism to update grid cells within a time step (same operation, different data) and task parallelism to concurrently execute separate atmosphere and ocean models (different tasks).¹⁰

9.1.2 Models for Shared Memory Systems

Programming models for shared-memory systems utilize the global address space, primarily offering mechanisms to create concurrent threads of execution and manage their access to shared data. The development of these models demonstrates an ongoing effort to increase abstraction, thereby protecting the programmer from the complexities of direct thread management while maintaining high performance.

Directive-Based Parallelism: OpenMP and the Fork-Join Model

Open Multi-Processing (OpenMP) is the primary standard for high-level, directive-based parallel programming on shared-memory systems.¹¹ OpenMP is a specification, not a library, implemented by compiler vendors (e.g., GCC, Intel C++ Compiler). It enables programmers to incrementally parallelize serial C, C++, and Fortran code using special preprocessor directives, known as pragmas.¹²

The Fork-Join Execution Model: An OpenMP program starts as a single master thread. When encountering a parallel region, it “forks,” creating a team of parallel worker threads. All threads execute the region concurrently. At the end, worker threads “join” back with the master thread, which resumes serial execution until the next parallel region.¹²

Key Features of OpenMP:

Loop Parallelization: Adding #pragma omp parallel for before a loop automatically distributes iterations among available threads¹²
Data Scope Control: Clauses manage variable visibility (e.g., private, shared)¹¹
Reduction Operations: Built-in support for common reductions (e.g., reduction(+:sum))¹¹
Scheduling Control: Various policies for iteration distribution among threads¹¹

OpenMP’s Key Advantage: Incremental parallelization—developers can achieve significant performance improvements on multi-core processors with minimal code changes, making it fundamental to HPC.⁷¹²

Explicit Threading: POSIX Threads (Pthreads)

In contrast to OpenMP, POSIX Threads (Pthreads) represent a lower level of abstraction. Pthreads is a low-level C programming API, standardized by the IEEE, offering direct and explicit control over threads.⁷ This library-based approach requires including the <pthread.h> header and linking against the Pthreads library.¹³ With Pthreads, the programmer manages all aspects of the thread lifecycle.¹³ New threads are created using the pthread_create() function, which specifies the function the new thread will execute.¹⁴ The main program can await a thread’s completion using pthread_join().¹³ Since all threads within a process share the same memory space, programmers must manually manage synchronization to prevent data races. Pthreads offers a suite of synchronization primitives, with the mutual exclusion lock, or mutex (pthread_mutex_t), being the most fundamental. A thread must acquire a lock using pthread_mutex_lock() before entering a critical section that accesses shared data and release it with pthread_mutex_unlock() upon exit, ensuring exclusive access to the critical section.¹³ This model provides extensive control and flexibility, but at the cost of significant programming complexity. Developers must identify and manage all potential data races, synchronization, and manual work partitioning among threads. Consequently, while Pthreads is powerful for system-level programming, it is often too cumbersome for general application development, where higher-level models are typically preferred.⁷

Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)

Intel’s Threading Building Blocks (now oneAPI Threading Building Blocks, or oneTBB) presents a modern C++ approach that balances the high-level simplicity of OpenMP with the explicit control of Pthreads.¹⁵ TBB is a C++ template library that abstracts raw threads, encouraging programmers to conceptualize work in terms of tasks.¹⁶ TBB’s philosophy promotes data-parallel programming using high-level parallel algorithms that operate on data collections, similar to the C++ Standard Template Library (STL).¹⁷ For instance, instead of manually creating threads for loop parallelization, a developer can use TBB’s tbb::parallel_for algorithm, providing it with a range of indices and a C++ lambda function or function object that defines the work for each iteration.¹⁶ The efficacy of TBB stems from its sophisticated runtime scheduler.¹⁸ In contrast to the often static scheduling of basic OpenMP, TBB utilizes a dynamic work-stealing scheduler to achieve automatic load balancing.¹⁹ In this model, each thread maintains its own task queue. When a thread becomes idle, it attempts to “steal” a task from the queue of another busy thread.¹⁹ This mechanism effectively manages applications with irregular or unpredictable task execution times, ensuring continuous processor core utilization without requiring manual programmer intervention.¹⁶ The combination of high-level C++ abstractions and an intelligent runtime positions TBB as a robust tool for developing complex, scalable parallel applications on shared-memory systems.

9.1.3 Models for Distributed Memory Systems

When computation extends beyond a single machine to clusters of interconnected nodes, the shared-memory assumption becomes invalid. Programming models for these systems must address the physical reality of distributed memory, prioritizing data locality and explicit communication.

Message Passing: The MPI Standard

The Message Passing Interface (MPI) is the established standard for programming distributed-memory parallel computers, ranging from small clusters to the largest supercomputers globally.⁵ MPI is a specification for a library of functions, not a language itself, that can be invoked from C, C++, and Fortran to manage parallel processes and their communication.²⁰

MPI Fundamental Concepts: An MPI program consists of independent processes, often on distinct physical nodes. Processes are organized into a communicator (default: MPI_COMM_WORLD). Each process has a unique integer rank (starting from 0) within its communicator.²⁰

Communication in MPI is explicit, relying on messages sent and received between processes identified by their ranks.²⁰

MPI Communication Patterns

Type	Operations	Description	Use Case
Point-to-Point	`MPI_Send()`, `MPI_Recv()`	Single sender, single receiver. Sender packages data and transmits to destination rank; receiver blocks until message arrives²¹	Direct data exchange between specific processes
Broadcast	`MPI_Bcast()`	One process sends identical data to all other processes²¹	Distributing global parameters or configuration
Scatter	`MPI_Scatter()`	One process distributes distinct portions of an array to all processes²¹	Partitioning data for parallel processing
Gather	`MPI_Gather()`	Inverse of Scatter—collects distinct data from all processes to one²¹	Collecting results after parallel computation
Reduction	`MPI_Reduce()`	Combines data from all processes into single result using specified operation (e.g., sum, max)²¹	Computing global aggregates like totals or maximums

MPI’s Design Philosophy: By requiring explicit management of data placement and inter-process communication, MPI offers maximum control over parallel execution in distributed systems. Though more complex than shared-memory models, its portability and scalability have made it fundamental to high-performance computing.²⁰

9.1.4 Models for Heterogeneous and Accelerator-Based Systems

The emergence of specialized hardware accelerators, particularly Graphics Processing Units (GPUs), has added a new dimension to parallel programming. These devices provide significant computational power but have distinct architectures and memory systems, requiring specialized programming models to manage the complex interaction between a host CPU and an accelerator device.

GPU Computing: NVIDIA’s CUDA and the OpenCL Standard

The evolution of GPUs from fixed-function graphics pipelines to fully programmable parallel processors for general-purpose computing (GPGPU) marked a significant turning point in HPC.⁷

CUDA (Compute Unified Device Architecture)

Developed by NVIDIA, CUDA is a proprietary parallel computing platform and programming model for its GPUs.²² It extends C++ with language extensions and a runtime library.

CUDA’s Heterogeneous Model: Explicitly based on interaction between a host (CPU) and devices (GPUs). Host and device have separate memory spaces (system RAM and GPU VRAM), requiring explicit data transfers via cudaMemcpy(). Parallel computations are written as kernels (designated by __global__), launched from the host with syntax specifying parallel thread count.²³²⁴

Component	Role	Key Operations
Host (CPU)	Orchestrates computation; manages data transfers	Allocate GPU memory, transfer data, launch kernels
Device (GPU)	Executes massively parallel computations	Run thousands of threads executing kernel code
Kernel	GPU function executed by many threads in parallel	Designated by `__global__`; launched with `<<<...>>>` syntax²⁴

OpenCL (Open Computing Language)

In response to demand for a vendor-neutral standard, the Khronos Group developed OpenCL.²⁵

OpenCL’s Portability Promise: Open, royalty-free standard for writing programs that execute on various platforms (CPUs, GPUs, DSPs, FPGAs). Theoretically, an OpenCL program can run on any compliant hardware from any vendor.²⁵²⁶

Feature	CUDA	OpenCL
Vendor	NVIDIA (proprietary)	Khronos Group (open standard)
Hardware Support	NVIDIA GPUs only	CPUs, GPUs, DSPs, FPGAs from multiple vendors²⁵
Programming Model	Host-device with explicit memory management	Host-device with context, command queues, kernels²⁶²⁷
Kernel Language	CUDA C/C++	OpenCL C (C99-based dialect)²⁶
Key Advantage	Highly optimized for NVIDIA hardware; mature ecosystem⁷	True cross-platform portability²⁵
Trade-off	Vendor lock-in	May sacrifice performance for genericity²⁸

Modern C++ Abstractions: SYCL

The complexity of managing separate host and device codebases in CUDA and OpenCL prompted the development of higher-level abstractions. SYCL (pronounced “sickle”) is a royalty-free, cross-platform abstraction layer from the Khronos Group that allows code for heterogeneous processors to be written in a single-source style using standard C++.²⁹

SYCL’s Single-Source Innovation: Both host and device code in the same source file, using modern C++ features (classes, templates, lambda functions). The SYCL runtime automatically manages kernel execution and data transfers through buffer and accessor objects, abstracting hardware complexity from the developer.²⁹³⁰

SYCL Architecture:

Component	Function
Queue Object	Submits work to compute devices³⁰
Command Group	Encapsulates kernel and data dependencies (defined in lambda)³⁰
Buffer & Accessor Objects	Describe and manage data transfers automatically²⁹
Backend	Operates on top of OpenCL, CUDA, or other compute APIs²⁹

This single-source approach enhances programmability and code maintainability for complex heterogeneous systems, representing a significant step in abstracting hardware complexity from the developer.²⁹

9.1.5 Comparative Analysis of Programming Models

The choice of a parallel programming model requires balancing trade-offs between performance, programmability, portability, and the target hardware. The evolution from low-level models such as Pthreads, which provide maximum control at the cost of high complexity, to high-level abstractions like SYCL, which prioritize programmer productivity and portability, directly addresses the increasing complexity of parallel hardware.⁷ This progression is a necessary adaptation to make parallel programming more accessible. The optimal model is context-dependent, determined by the application’s specific requirements, the available hardware, and development resources. The following table summarizes the key models discussed, outlining their defining characteristics and typical use cases.

Model	Primary Memory Model	Parallelism Type	Abstraction Level	Primary Use Case	Programmability
Pthreads	Shared	Task	Low	Fine-grained thread control, systems programming	Low
OpenMP	Shared	Data / Task	High	Incremental parallelization of loops on multi-core CPUs	High
oneTBB	Shared	Data / Task	High	C++ task-based parallelism with dynamic load balancing	High
MPI	Distributed	Data / Task	Low	Large-scale cluster and supercomputer programming	Low
CUDA	Host-Device	Data	Medium	High-performance computing on NVIDIA GPUs	Medium
OpenCL	Host-Device	Data / Task	Medium	Portable parallel programming for heterogeneous accelerators	Medium
SYCL	Host-Device	Data / Task	High	Single-source C++ for portable heterogeneous programming	High

References

Models for Parallel Computing : Review and Perspectives - Semantic Scholar, accessed October 7, 2025, https://www.semanticscholar.org/paper/Models-for-Parallel-Computing-%3A-Review-and-Kessler-Keller/c924481fbb05bb807920c8f3f2f4d9234c9f1c29 ↩
Insights on Parallel Programming Model - Advanced Millennium Technologies, accessed October 7, 2025, https://blog.amt.in/index.php/2023/01/17/insights-on-parallel-programming-model/ ↩ ↩² ↩³ ↩⁴
Shared Versus Distributed Memory Multiprocessors - ECMWF, accessed October 7, 2025, https://www.ecmwf.int/sites/default/files/elibrary/1990/10302-shared-versus-distributed-memory-multiprocessors.pdf ↩ ↩²
Distributed memory - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Distributed_memory ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
How Parallel Processing Shaped Modern Computing - CelerData, accessed October 7, 2025, https://celerdata.com/glossary/how-parallel-processing-shaped-modern-computing ↩ ↩² ↩³
Parallel programming model - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Parallel_programming_model ↩ ↩² ↩³ ↩⁴
The Evolution of Parallel Programming | by Tiwariabhinav | Medium, accessed October 7, 2025, https://medium.com/@tiwariabhinav424/the-evolution-of-parallel-programming-d80665066b88 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Parallel Computing at a Glance, accessed October 7, 2025, http://www.buyya.com/microkernel/chap1.pdf ↩
Data Parallel, Task Parallel, and Agent Actor Architectures - bytewax, accessed October 7, 2025, https://bytewax.io/blog/data-parallel-task-parallel-and-agent-actor-architectures ↩ ↩²
Task parallelism - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Task_parallelism ↩ ↩² ↩³
Using OpenMP with C — Research Computing University of …, accessed October 7, 2025, https://curc.readthedocs.io/en/latest/programming/OpenMP-C.html ↩ ↩² ↩³ ↩⁴
An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html ↩ ↩² ↩³ ↩⁴
Multithreaded Programming (POSIX pthreads Tutorial) - randu.org, accessed October 7, 2025, https://randu.org/tutorials/threads/ ↩ ↩² ↩³ ↩⁴
C++ Tutorial: Multi-Threaded Programming - C++ Class Thread for Pthreads - 2020 - BogoToBogo, accessed October 7, 2025, https://www.bogotobogo.com/cplusplus/multithreading_pthread.php ↩
Getting Started with Intel® Threading Building Blocks (Intel® TBB), accessed October 7, 2025, https://www.intel.com/content/www/us/en/developer/articles/guide/get-started-with-tbb.html ↩
Introduction to the Intel Threading Building Blocks — mcs572 0.7.8 documentation, accessed October 7, 2025, http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec11.html ↩ ↩² ↩³
TBB Tutorial - cs.wisc.edu, accessed October 7, 2025, https://pages.cs.wisc.edu/~gibson/tbbTutorial.html ↩
Introducing Intel Tbb - Agenda INFN, accessed October 7, 2025, https://agenda.infn.it/event/4107/contributions/50346/attachments/35473/41878/Tbb_Introduction.pdf ↩
Threading Building Blocks - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Threading_Building_Blocks ↩ ↩²
MPI Tutorial – Part 1, accessed October 7, 2025, https://spcl.inf.ethz.ch/Teaching/2017-dphpc/recitation/mpi1.pdf ↩ ↩² ↩³ ↩⁴
Tutorials · MPI Tutorial, accessed October 7, 2025, https://mpitutorial.com/tutorials/ ↩ ↩² ↩³ ↩⁴ ↩⁵
(PDF) Review of Architecture and Model for Parallel Programming - ResearchGate, accessed October 7, 2025, https://www.researchgate.net/publication/372822306_Review_of_Architecture_and_Model_for_Parallel_Programming ↩
CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf ↩
CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 ↩ ↩²
OpenCL - The Open Standard for Parallel Programming of …, accessed October 7, 2025, https://www.khronos.org/opencl/ ↩ ↩² ↩³ ↩⁴
OpenCL - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/OpenCL ↩ ↩² ↩³
OpenCL execution model - Arm Developer, accessed October 7, 2025, https://developer.arm.com/documentation/dui0538/e/opencl-concepts/opencl-execution-model ↩
An introduction to OpenCL - Purdue Engineering, accessed October 7, 2025, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod20OpenCL/3rd-Edition-AppendixA-intro-to-OpenCL.pdf ↩
Heterogeneous programming with SYCL documentation, accessed October 7, 2025, https://enccs.github.io/sycl-workshop/ ↩ ↩² ↩³ ↩⁴ ↩⁵
Getting Started - SYCL.tech, accessed October 7, 2025, https://sycl.tech/getting-started/ ↩ ↩² ↩³