Skip to content

9.1 Survey of Parallel Programming Models and Libraries

The landscape of parallel programming is rich and varied, populated by a multitude of models, languages, and libraries that have evolved over decades to address the ever-changing face of computer architecture.1 Selecting the appropriate tool for a given task requires a clear understanding of this landscape, which is best navigated by first grasping the fundamental principles that differentiate these models. At the highest level, these distinctions are driven by two foundational dichotomies: how the underlying hardware organizes memory and how the computational work is decomposed.2

9.1.1 Foundational Dichotomies: Memory and Work Decomposition

Section titled “9.1.1 Foundational Dichotomies: Memory and Work Decomposition”

The most profound influence on the design of a parallel programming model is the architecture of the machine’s memory system. This single characteristic dictates how processing elements communicate and, consequently, imposes a cognitive model on the programmer that is difficult, if not impossible, to escape.

Shared vs. Distributed Memory Architectures

Section titled “Shared vs. Distributed Memory Architectures”

Parallel computer architectures are broadly classified into two families based on how processors access memory.3

  • Shared Memory: In a shared-memory architecture, all processors or cores are connected to a single, global address space.4 Any processor can read from or write to any memory location. This model is conceptually simple for the programmer; communication occurs implicitly by reading and writing to shared variables.5 Modern multi-core CPUs found in desktops, laptops, and servers are quintessential examples of shared-memory systems.6 However, this simplicity belies significant challenges. The primary difficulty is managing concurrent access to shared data to prevent race conditions, where the final outcome of a computation depends on the unpredictable timing of operations by different threads. To ensure correctness, programmers must use synchronization mechanisms like locks, semaphores, or atomic operations to protect critical sections of code.7 Furthermore, as the number of processors increases, the single shared memory bus can become a performance bottleneck, limiting the scalability of the system.3
  • Distributed Memory: In a distributed-memory architecture, each processor has its own private, local memory that other processors cannot directly access.4 These systems consist of multiple independent nodes (each with its own processor and memory) connected by a network.5 Communication is achieved explicitly through message passing, where one processor sends a message containing data to another processor, which must explicitly receive it.4 This model is more complex to program, as the developer is responsible for all data partitioning, distribution, and communication.4 However, it is highly scalable; by adding more nodes, one can increase both computational power and aggregate memory capacity. This is the dominant architecture for large-scale systems, including high-performance computing (HPC) clusters and supercomputers.8
  • Hybrid Models: To bridge the gap between these two extremes, hybrid models have been developed. Distributed Shared Memory (DSM) systems provide the illusion of a single shared address space on top of physically distributed memory, hiding the underlying message passing from the programmer.4 While this simplifies programming, it does not eliminate the performance penalty of remote memory access; the latency of communication is hidden, but not removed.4 Another approach is the Partitioned Global Address Space (PGAS) model, which exposes a global address space but logically partitions it, with each portion having an “affinity” to a specific processor. This allows the programmer to reason about data locality while still using convenient shared-memory-style reads and writes.7

The choice between a shared or distributed memory model is not merely an implementation detail; it fundamentally shapes the entire programming paradigm. A shared-memory program is primarily concerned with synchronization, while a distributed-memory program is primarily concerned with data movement. Porting an application from one to the other requires a complete rethinking of the algorithm’s data structures and communication patterns.

The second fundamental dichotomy concerns how a problem is decomposed into parallel units of work.2

  • Data Parallelism: This model focuses on distributing a large dataset across multiple processing units and applying the same operation to each subset of the data concurrently.2 The parallelism stems from the data, not the operations. This is an extremely common and highly scalable pattern, particularly well-suited for regularly structured data like arrays and matrices.7 Classic examples include vector processing in scientific simulations, image processing where the same filter is applied to every pixel, and training deep learning models on large batches of data.9 Architecturally, this maps well to Single Instruction, Multiple Data (SIMD) or Single Program, Multiple Data (SPMD) execution models.10
  • Task Parallelism: This model, also known as function or control parallelism, focuses on distributing different tasks (functions, code blocks) across different processors for concurrent execution.9 The tasks may operate on the same data or on different data, and they can be entirely independent or part of a complex workflow with dependencies.10 A common example is pipelining, where a stream of data passes through a series of stages, with each stage performing a different task.10 Task parallelism is more flexible for irregular or heterogeneous problems but can present significant challenges in load balancing, as tasks may have widely varying execution times.9 This model corresponds to the Multiple Instruction, Multiple Data (MIMD) architectural classification.2

In practice, few real-world applications are purely data-parallel or purely task-parallel. Most exist on a continuum, often exhibiting both forms of parallelism at different levels of the program’s structure.10 For instance, a climate simulation might use data parallelism to update the state of grid cells within a time step (same operation, different data) while using task parallelism to concurrently run separate models for the atmosphere and the ocean (different tasks).

Programming models for shared-memory systems leverage the convenience of a global address space, focusing primarily on providing mechanisms to create concurrent threads of execution and manage their access to shared data. The evolution of these models reflects a continuous effort to raise the level of abstraction, shielding the programmer from the complexities of direct thread management while still enabling high performance.

Directive-Based Parallelism: OpenMP and the Fork-Join Model

Section titled “Directive-Based Parallelism: OpenMP and the Fork-Join Model”

Open Multi-Processing (OpenMP) is the dominant standard for high-level, directive-based parallel programming on shared-memory systems.11 Rather than being a library, OpenMP is a specification implemented by compiler vendors (like GCC, Intel C++ Compiler, etc.) that allows a programmer to incrementally parallelize serial C, C++, and Fortran code using special preprocessor directives, or pragmas.12 The core of OpenMP is its fork-join execution model.12 An OpenMP program begins execution as a single process, known as the master thread. When the master thread encounters a parallel region (demarcated by an OpenMP directive), it “forks,” creating a team of parallel worker threads. The code within this region is then executed in parallel by all threads in the team. At the end of the parallel region, the worker threads “join” back with the master thread, which then continues with serial execution until the next parallel region is encountered.12 The most common use of OpenMP is for loop parallelization. By adding a single line of code, #pragma omp parallel for, before a for loop, the programmer instructs the compiler to automatically divide the loop’s iterations among the available threads.12 This simplicity is OpenMP’s greatest strength; it allows developers to achieve significant performance gains on multi-core processors with minimal code modification, making it a cornerstone of HPC.6 The model also provides a rich set of clauses to manage data scope (private, shared), perform reductions (reduction(+:sum)), and control scheduling.11

Explicit Threading: POSIX Threads (Pthreads)

Section titled “Explicit Threading: POSIX Threads (Pthreads)”

At the opposite end of the abstraction spectrum from OpenMP lies POSIX Threads, or Pthreads. Pthreads is not a high-level model but a low-level C programming API, standardized by the IEEE, that provides direct and explicit control over threads.6 It is a library-based approach, requiring the inclusion of the <pthread.h> header and linking against the Pthreads library.13 With Pthreads, the programmer is responsible for every aspect of the thread lifecycle.13 A new thread is created with the pthread_create() function, which specifies the function the new thread will execute.14 The main program can wait for a thread to complete its execution using pthread_join().13 Because all threads within a process share the same memory space, the programmer must manually manage synchronization to prevent data races. Pthreads provides a suite of synchronization primitives for this purpose, the most fundamental of which is the mutual exclusion lock, or mutex (pthread_mutex_t). A thread must acquire a lock using pthread_mutex_lock() before entering a critical section that accesses shared data and release it with pthread_mutex_unlock() upon exit, ensuring that only one thread can be in the critical section at a time.13 This model offers the ultimate in control and flexibility but comes at the cost of significant programming complexity. The developer must reason about every potential data race, manage all synchronization, and manually partition the work among threads. This makes Pthreads powerful for system-level programming but often too cumbersome for general application development, where higher-level models are preferred.6

Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)

Section titled “Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)”

Intel’s Threading Building Blocks (now oneAPI Threading Building Blocks, or oneTBB) offers a modern C++ approach that strikes a balance between the high-level simplicity of OpenMP and the explicit control of Pthreads.15 TBB is a C++ template library that abstracts away from the concept of raw threads and instead encourages the programmer to think in terms of tasks.16 TBB’s philosophy is to enable data-parallel programming through high-level parallel algorithms that operate on data collections, much like the C++ Standard Template Library (STL).17 For example, instead of manually creating threads to parallelize a loop, a developer can use TBB’s tbb::parallel_for algorithm, passing it a range of indices and a C++ lambda function or function object that defines the work for each iteration.16 The true power of TBB lies in its sophisticated runtime scheduler.18 Unlike the often static scheduling of basic OpenMP, TBB employs a dynamic work-stealing scheduler for automatic load balancing.19 In this model, each thread maintains its own queue of tasks. When a thread runs out of work, it becomes a “thief” and attempts to “steal” a task from the queue of another, busy “victim” thread.19 This mechanism is highly effective for applications with irregular or unpredictable task execution times, as it keeps all processor cores busy without requiring manual intervention from the programmer.16 This combination of high-level C++ abstractions and an intelligent runtime makes TBB a powerful tool for developing complex, scalable parallel applications on shared-memory systems.

9.1.3 Models for Distributed Memory Systems

Section titled “9.1.3 Models for Distributed Memory Systems”

When computation scales beyond a single machine to clusters of interconnected nodes, the shared-memory assumption breaks down. Programming models for these systems must confront the physical reality of distributed memory, making data locality and explicit communication first-class concerns.

The Message Passing Interface (MPI) is the undisputed, de facto standard for programming distributed-memory parallel computers, from small clusters to the world’s largest supercomputers.5 MPI is not a language but a specification for a library of functions that can be called from C, C++, and Fortran to manage parallel processes and their communication.20 The fundamental concepts of MPI are straightforward.20 An MPI program is launched as a collection of independent processes, often running on different physical nodes. These processes are organized into a communication group, or communicator, the default being MPI_COMM_WORLD, which includes all processes. Within a communicator, each process is assigned a unique integer identifier called its rank, starting from 0.20 Communication is explicit and based on sending and receiving messages between processes, identified by their ranks.20 MPI defines a rich set of communication operations, which fall into two main categories:

  1. Point-to-Point Communication: These operations involve a single sender and a single receiver. The most basic are blocking sends and receives, MPI_Send() and MPI_Recv(). When a process calls MPI_Send(), it packages data into a message and sends it to a destination rank. The MPI_Recv() call waits until a message from a source rank arrives and is placed into a specified buffer.21
  2. Collective Communication: These operations involve all processes within a communicator and are used to implement common parallel patterns efficiently. Examples include MPI_Bcast() (broadcast), where one process sends the same data to all other processes; MPI_Scatter(), where one process distributes different chunks of an array to all other processes; MPI_Gather(), the inverse of scatter; and MPI_Reduce(), which combines data from all processes into a single result (e.g., finding a global sum) using a specified operation.21

By forcing the programmer to manage all data placement and inter-process communication explicitly, MPI provides maximum control over the parallel execution on a distributed system. While this makes it more complex than shared-memory models, its portability and scalability have made it an enduring and essential tool in the field of high-performance computing.20

9.1.4 Models for Heterogeneous and Accelerator-Based Systems

Section titled “9.1.4 Models for Heterogeneous and Accelerator-Based Systems”

The rise of specialized hardware accelerators, most notably Graphics Processing Units (GPUs), has introduced a new dimension to parallel programming. These devices offer immense computational power but feature distinct architectures and memory systems, necessitating specialized programming models that can manage the complex interplay between a traditional host CPU and a powerful accelerator device.

GPU Computing: NVIDIA’s CUDA and the OpenCL Standard

Section titled “GPU Computing: NVIDIA’s CUDA and the OpenCL Standard”

The transformation of GPUs from fixed-function graphics pipelines to fully programmable parallel processors for general-purpose computing (GPGPU) was a watershed moment in HPC.6

  • CUDA (Compute Unified Device Architecture): Developed by NVIDIA, CUDA is a proprietary parallel computing platform and programming model for its GPUs.22 It extends C++ with a set of language extensions and a runtime library. The CUDA programming model is explicitly heterogeneous, based on the interaction between a host (the CPU) and one or more devices (the GPUs).23 The host and device have separate memory spaces (system RAM and GPU VRAM, respectively), and the programmer is responsible for explicitly managing data transfers between them using functions like cudaMemcpy().24 The parallel computations to be run on the GPU are written as special functions called kernels, which are marked with the __global__ specifier. The host launches a kernel onto the device using a special syntax (kernel_name<<<…>>>), specifying the number of parallel threads to execute.24 This model gives the programmer fine-grained control over the massive parallelism of the GPU, making CUDA a dominant force in fields like deep learning and scientific simulation.6
  • OpenCL (Open Computing Language): In response to the need for a vendor-neutral standard for heterogeneous computing, the Khronos Group developed OpenCL.25 OpenCL is an open, royalty-free standard for writing programs that execute across diverse platforms, including CPUs, GPUs, Digital Signal Processors (DSPs), and Field-Programmable Gate Arrays (FPGAs).25 Like CUDA, OpenCL employs a host-device model with explicit memory management and kernel execution.26 A host program defines a context, manages command queues, and submits kernels for execution on compute devices.27 The kernels themselves are typically written in a C99-based language dialect called OpenCL C.26 The primary advantage of OpenCL is its portability; in theory, an OpenCL program can run on any compliant hardware from any vendor.25 However, this portability often comes at the cost of performance, as code must be more generic and cannot always take advantage of vendor-specific hardware features as effectively as a proprietary model like CUDA.28

The complexity of managing separate host and device codebases in CUDA and OpenCL led to the development of higher-level abstractions. SYCL (pronounced “sickle”) is a royalty-free, cross-platform abstraction layer from the Khronos Group that enables code for heterogeneous processors to be written in a single-source style using standard C++.29 SYCL is not a standalone language but a C++ template library built on top of a backend like OpenCL, CUDA, or others.29 Its key innovation is allowing both host and device code to be written in the same source file using modern C++ features like classes, templates, and lambda functions.29 The programmer submits work to a device via a queue object. The work itself, including the kernel and its data dependencies, is encapsulated within a command group defined inside a lambda function.30 The SYCL runtime automatically manages the execution of the kernel on the target device and handles the necessary data transfers, which are described using buffer and accessor objects.29 This single-source approach significantly improves programmability and code maintainability for complex heterogeneous systems, representing the latest step in the ongoing evolution to abstract away hardware complexity from the developer.29

9.1.5 Comparative Analysis of Programming Models

Section titled “9.1.5 Comparative Analysis of Programming Models”

The choice of a parallel programming model involves a complex series of trade-offs between performance, programmability, portability, and the nature of the target hardware. The evolution from low-level models like Pthreads, which offer maximum control at the cost of extreme complexity, to high-level abstractions like SYCL, which prioritize programmer productivity and portability, is a direct response to the escalating complexity of parallel hardware.6 This progression is not merely a matter of convenience; it is a necessary adaptation to make parallel programming accessible beyond a small niche of HPC experts. The “best” model is therefore highly context-dependent, defined by the specific requirements of the application, the available hardware, and the development resources. The following table provides a comparative summary of the key models discussed, highlighting their defining characteristics and typical use cases.

ModelPrimary Memory ModelParallelism TypeAbstraction LevelPrimary Use CaseProgrammability
PthreadsSharedTaskLowFine-grained thread control, systems programmingLow
OpenMPSharedData / TaskHighIncremental parallelization of loops on multi-core CPUsHigh
oneTBBSharedData / TaskHighC++ task-based parallelism with dynamic load balancingHigh
MPIDistributedData / TaskLowLarge-scale cluster and supercomputer programmingLow
CUDAHost-DeviceDataMediumHigh-performance computing on NVIDIA GPUsMedium
OpenCLHost-DeviceData / TaskMediumPortable parallel programming for heterogeneous acceleratorsMedium
SYCLHost-DeviceData / TaskHighSingle-source C++ for portable heterogeneous programmingHigh
  1. Models for Parallel Computing : Review and Perspectives - Semantic Scholar, accessed October 7, 2025, https://www.semanticscholar.org/paper/Models-for-Parallel-Computing-%3A-Review-and-Kessler-Keller/c924481fbb05bb807920c8f3f2f4d9234c9f1c29

  2. Insights on Parallel Programming Model - Advanced Millennium Technologies, accessed October 7, 2025, https://blog.amt.in/index.php/2023/01/17/insights-on-parallel-programming-model/ 2 3 4

  3. Shared Versus Distributed Memory Multiprocessors - ECMWF, accessed October 7, 2025, https://www.ecmwf.int/sites/default/files/elibrary/1990/10302-shared-versus-distributed-memory-multiprocessors.pdf 2

  4. Distributed memory - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Distributed_memory 2 3 4 5 6

  5. How Parallel Processing Shaped Modern Computing - CelerData, accessed October 7, 2025, https://celerdata.com/glossary/how-parallel-processing-shaped-modern-computing 2 3

  6. The Evolution of Parallel Programming | by Tiwariabhinav | Medium, accessed October 7, 2025, https://medium.com/@tiwariabhinav424/the-evolution-of-parallel-programming-d80665066b88 2 3 4 5 6 7

  7. Parallel programming model - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Parallel_programming_model 2 3

  8. Parallel Computing at a Glance, accessed October 7, 2025, http://www.buyya.com/microkernel/chap1.pdf

  9. Data Parallel, Task Parallel, and Agent Actor Architectures - bytewax, accessed October 7, 2025, https://bytewax.io/blog/data-parallel-task-parallel-and-agent-actor-architectures 2 3

  10. Task parallelism - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Task_parallelism 2 3 4

  11. Using OpenMP with C — Research Computing University of …, accessed October 7, 2025, https://curc.readthedocs.io/en/latest/programming/OpenMP-C.html 2

  12. An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html 2 3 4

  13. Multithreaded Programming (POSIX pthreads Tutorial) - randu.org, accessed October 7, 2025, https://randu.org/tutorials/threads/ 2 3 4

  14. C++ Tutorial: Multi-Threaded Programming - C++ Class Thread for Pthreads - 2020 - BogoToBogo, accessed October 7, 2025, https://www.bogotobogo.com/cplusplus/multithreading_pthread.php

  15. Getting Started with Intel® Threading Building Blocks (Intel® TBB), accessed October 7, 2025, https://www.intel.com/content/www/us/en/developer/articles/guide/get-started-with-tbb.html

  16. Introduction to the Intel Threading Building Blocks — mcs572 0.7.8 documentation, accessed October 7, 2025, http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec11.html 2 3

  17. TBB Tutorial - cs.wisc.edu, accessed October 7, 2025, https://pages.cs.wisc.edu/~gibson/tbbTutorial.html

  18. Introducing Intel Tbb - Agenda INFN, accessed October 7, 2025, https://agenda.infn.it/event/4107/contributions/50346/attachments/35473/41878/Tbb_Introduction.pdf

  19. Threading Building Blocks - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Threading_Building_Blocks 2

  20. MPI Tutorial – Part 1, accessed October 7, 2025, https://spcl.inf.ethz.ch/Teaching/2017-dphpc/recitation/mpi1.pdf 2 3 4 5

  21. Tutorials · MPI Tutorial, accessed October 7, 2025, https://mpitutorial.com/tutorials/ 2

  22. (PDF) Review of Architecture and Model for Parallel Programming - ResearchGate, accessed October 7, 2025, https://www.researchgate.net/publication/372822306_Review_of_Architecture_and_Model_for_Parallel_Programming

  23. CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

  24. CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 2

  25. OpenCL - The Open Standard for Parallel Programming of …, accessed October 7, 2025, https://www.khronos.org/opencl/ 2 3

  26. OpenCL - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/OpenCL 2

  27. OpenCL execution model - Arm Developer, accessed October 7, 2025, https://developer.arm.com/documentation/dui0538/e/opencl-concepts/opencl-execution-model

  28. An introduction to OpenCL - Purdue Engineering, accessed October 7, 2025, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod20OpenCL/3rd-Edition-AppendixA-intro-to-OpenCL.pdf

  29. Heterogeneous programming with SYCL documentation, accessed October 7, 2025, https://enccs.github.io/sycl-workshop/ 2 3 4 5

  30. Getting Started - SYCL.tech, accessed October 7, 2025, https://sycl.tech/getting-started/