Skip to content

9.1 Parallel Programming Models

The field of parallel programming encompasses a diverse range of models, languages, and libraries. These have evolved over several decades to adapt to the changing landscape of computer architecture.1 To select the appropriate tool for a given task, one must understand the fundamental principles that differentiate these models. These distinctions are primarily driven by two fundamental dichotomies: how the underlying hardware organizes memory and how computational work is decomposed.2

9.1.1 Foundational Dichotomies: Memory and Work Decomposition

Section titled “9.1.1 Foundational Dichotomies: Memory and Work Decomposition”

The design of a parallel programming model is significantly influenced by the machine’s memory system architecture. This characteristic determines how processing elements communicate and shapes the programmer’s approach to problem-solving.

Shared vs. Distributed Memory Architectures

Section titled “Shared vs. Distributed Memory Architectures”

Parallel computer architectures are broadly classified into two families based on how processors access memory.3

Architecture TypeMemory OrganizationCommunication MethodKey AdvantagesKey ChallengesExamples
Shared MemorySingle global address space accessible by all processors4Implicit through shared variables5Simpler programming model; implicit communicationRace conditions; synchronization overhead; limited scalability due to memory bus contention36Multi-core CPUs, desktops, laptops, servers7
Distributed MemoryEach processor has private local memory4Explicit message passing4High scalability; adding nodes increases compute power and memoryComplex programming; must manage data partitioning, distribution, and communication4HPC clusters, supercomputers8
Distributed Shared Memory (DSM)Illusion of shared address space over distributed hardware4Appears shared but uses message passing underneathSimpler than pure distributed memoryHidden but present communication latency4Research systems, some commercial platforms
PGAS (Partitioned Global Address Space)Logically partitioned global address space with processor affinity6Shared-memory-style reads/writes with locality awarenessBalance of programmability and performance; locality controlRequires careful data placementUPC, Chapel, X10 languages

The Fundamental Trade-off: Shared-memory programs focus on synchronization (preventing race conditions through locks, semaphores, and atomic operations), while distributed-memory programs emphasize data movement (managing explicit communication between processors). Migrating an application from one model to the other necessitates complete re-evaluation of the algorithm’s data structures and communication patterns.456

The second fundamental dichotomy concerns how a problem is decomposed into parallel units of work.2

Parallelism TypeDefinitionSource of ParallelismBest Suited ForCommon ExamplesArchitectural Alignment
Data ParallelismDistribute large dataset across multiple processing units; apply same operation to each subset2The data, not the operationsRegularly structured data (arrays, matrices)6Vector processing in simulations; image filters applied to pixels; deep learning training on batches9SIMD, SPMD10
Task ParallelismDistribute distinct tasks across different processors for concurrent execution9Different operations/functionsIrregular or heterogeneous problems; workflows with dependencies10Pipelining; concurrent atmosphere and ocean models in climate simulationMIMD2

Real-World Hybrid Approach: Few applications are exclusively data-parallel or task-parallel. Most exhibit a combination of both at various program levels. A climate simulation might employ data parallelism to update grid cells within a time step (same operation, different data) and task parallelism to concurrently execute separate atmosphere and ocean models (different tasks).10

Programming models for shared-memory systems utilize the global address space, primarily offering mechanisms to create concurrent threads of execution and manage their access to shared data. The development of these models demonstrates an ongoing effort to increase abstraction, thereby protecting the programmer from the complexities of direct thread management while maintaining high performance.

Directive-Based Parallelism: OpenMP and the Fork-Join Model

Section titled “Directive-Based Parallelism: OpenMP and the Fork-Join Model”

Open Multi-Processing (OpenMP) is the primary standard for high-level, directive-based parallel programming on shared-memory systems.11 OpenMP is a specification, not a library, implemented by compiler vendors (e.g., GCC, Intel C++ Compiler). It enables programmers to incrementally parallelize serial C, C++, and Fortran code using special preprocessor directives, known as pragmas.12

The Fork-Join Execution Model: An OpenMP program starts as a single master thread. When encountering a parallel region, it “forks,” creating a team of parallel worker threads. All threads execute the region concurrently. At the end, worker threads “join” back with the master thread, which resumes serial execution until the next parallel region.12

Key Features of OpenMP:

  1. Loop Parallelization: Adding #pragma omp parallel for before a loop automatically distributes iterations among available threads12
  2. Data Scope Control: Clauses manage variable visibility (e.g., private, shared)11
  3. Reduction Operations: Built-in support for common reductions (e.g., reduction(+:sum))11
  4. Scheduling Control: Various policies for iteration distribution among threads11

OpenMP’s Key Advantage: Incremental parallelization—developers can achieve significant performance improvements on multi-core processors with minimal code changes, making it fundamental to HPC.712

Explicit Threading: POSIX Threads (Pthreads)

Section titled “Explicit Threading: POSIX Threads (Pthreads)”

In contrast to OpenMP, POSIX Threads (Pthreads) represent a lower level of abstraction. Pthreads is a low-level C programming API, standardized by the IEEE, offering direct and explicit control over threads.7 This library-based approach requires including the <pthread.h> header and linking against the Pthreads library.13 With Pthreads, the programmer manages all aspects of the thread lifecycle.13 New threads are created using the pthread_create() function, which specifies the function the new thread will execute.14 The main program can await a thread’s completion using pthread_join().13 Since all threads within a process share the same memory space, programmers must manually manage synchronization to prevent data races. Pthreads offers a suite of synchronization primitives, with the mutual exclusion lock, or mutex (pthread_mutex_t), being the most fundamental. A thread must acquire a lock using pthread_mutex_lock() before entering a critical section that accesses shared data and release it with pthread_mutex_unlock() upon exit, ensuring exclusive access to the critical section.13 This model provides extensive control and flexibility, but at the cost of significant programming complexity. Developers must identify and manage all potential data races, synchronization, and manual work partitioning among threads. Consequently, while Pthreads is powerful for system-level programming, it is often too cumbersome for general application development, where higher-level models are typically preferred.7

Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)

Section titled “Task-Based Parallelism: Intel Threading Building Blocks (oneTBB)”

Intel’s Threading Building Blocks (now oneAPI Threading Building Blocks, or oneTBB) presents a modern C++ approach that balances the high-level simplicity of OpenMP with the explicit control of Pthreads.15 TBB is a C++ template library that abstracts raw threads, encouraging programmers to conceptualize work in terms of tasks.16 TBB’s philosophy promotes data-parallel programming using high-level parallel algorithms that operate on data collections, similar to the C++ Standard Template Library (STL).17 For instance, instead of manually creating threads for loop parallelization, a developer can use TBB’s tbb::parallel_for algorithm, providing it with a range of indices and a C++ lambda function or function object that defines the work for each iteration.16 The efficacy of TBB stems from its sophisticated runtime scheduler.18 In contrast to the often static scheduling of basic OpenMP, TBB utilizes a dynamic work-stealing scheduler to achieve automatic load balancing.19 In this model, each thread maintains its own task queue. When a thread becomes idle, it attempts to “steal” a task from the queue of another busy thread.19 This mechanism effectively manages applications with irregular or unpredictable task execution times, ensuring continuous processor core utilization without requiring manual programmer intervention.16 The combination of high-level C++ abstractions and an intelligent runtime positions TBB as a robust tool for developing complex, scalable parallel applications on shared-memory systems.

9.1.3 Models for Distributed Memory Systems

Section titled “9.1.3 Models for Distributed Memory Systems”

When computation extends beyond a single machine to clusters of interconnected nodes, the shared-memory assumption becomes invalid. Programming models for these systems must address the physical reality of distributed memory, prioritizing data locality and explicit communication.

The Message Passing Interface (MPI) is the established standard for programming distributed-memory parallel computers, ranging from small clusters to the largest supercomputers globally.5 MPI is a specification for a library of functions, not a language itself, that can be invoked from C, C++, and Fortran to manage parallel processes and their communication.20

MPI Fundamental Concepts: An MPI program consists of independent processes, often on distinct physical nodes. Processes are organized into a communicator (default: MPI_COMM_WORLD). Each process has a unique integer rank (starting from 0) within its communicator.20

Communication in MPI is explicit, relying on messages sent and received between processes identified by their ranks.20

TypeOperationsDescriptionUse Case
Point-to-PointMPI_Send(), MPI_Recv()Single sender, single receiver. Sender packages data and transmits to destination rank; receiver blocks until message arrives21Direct data exchange between specific processes
BroadcastMPI_Bcast()One process sends identical data to all other processes21Distributing global parameters or configuration
ScatterMPI_Scatter()One process distributes distinct portions of an array to all processes21Partitioning data for parallel processing
GatherMPI_Gather()Inverse of Scatter—collects distinct data from all processes to one21Collecting results after parallel computation
ReductionMPI_Reduce()Combines data from all processes into single result using specified operation (e.g., sum, max)21Computing global aggregates like totals or maximums

MPI’s Design Philosophy: By requiring explicit management of data placement and inter-process communication, MPI offers maximum control over parallel execution in distributed systems. Though more complex than shared-memory models, its portability and scalability have made it fundamental to high-performance computing.20

9.1.4 Models for Heterogeneous and Accelerator-Based Systems

Section titled “9.1.4 Models for Heterogeneous and Accelerator-Based Systems”

The emergence of specialized hardware accelerators, particularly Graphics Processing Units (GPUs), has added a new dimension to parallel programming. These devices provide significant computational power but have distinct architectures and memory systems, requiring specialized programming models to manage the complex interaction between a host CPU and an accelerator device.

GPU Computing: NVIDIA’s CUDA and the OpenCL Standard

Section titled “GPU Computing: NVIDIA’s CUDA and the OpenCL Standard”

The evolution of GPUs from fixed-function graphics pipelines to fully programmable parallel processors for general-purpose computing (GPGPU) marked a significant turning point in HPC.7

CUDA (Compute Unified Device Architecture)

Developed by NVIDIA, CUDA is a proprietary parallel computing platform and programming model for its GPUs.22 It extends C++ with language extensions and a runtime library.

CUDA’s Heterogeneous Model: Explicitly based on interaction between a host (CPU) and devices (GPUs). Host and device have separate memory spaces (system RAM and GPU VRAM), requiring explicit data transfers via cudaMemcpy(). Parallel computations are written as kernels (designated by __global__), launched from the host with syntax specifying parallel thread count.2324

ComponentRoleKey Operations
Host (CPU)Orchestrates computation; manages data transfersAllocate GPU memory, transfer data, launch kernels
Device (GPU)Executes massively parallel computationsRun thousands of threads executing kernel code
KernelGPU function executed by many threads in parallelDesignated by __global__; launched with <<<...>>> syntax24

OpenCL (Open Computing Language)

In response to demand for a vendor-neutral standard, the Khronos Group developed OpenCL.25

OpenCL’s Portability Promise: Open, royalty-free standard for writing programs that execute on various platforms (CPUs, GPUs, DSPs, FPGAs). Theoretically, an OpenCL program can run on any compliant hardware from any vendor.2526

FeatureCUDAOpenCL
VendorNVIDIA (proprietary)Khronos Group (open standard)
Hardware SupportNVIDIA GPUs onlyCPUs, GPUs, DSPs, FPGAs from multiple vendors25
Programming ModelHost-device with explicit memory managementHost-device with context, command queues, kernels2627
Kernel LanguageCUDA C/C++OpenCL C (C99-based dialect)26
Key AdvantageHighly optimized for NVIDIA hardware; mature ecosystem7True cross-platform portability25
Trade-offVendor lock-inMay sacrifice performance for genericity28

The complexity of managing separate host and device codebases in CUDA and OpenCL prompted the development of higher-level abstractions. SYCL (pronounced “sickle”) is a royalty-free, cross-platform abstraction layer from the Khronos Group that allows code for heterogeneous processors to be written in a single-source style using standard C++.29

SYCL’s Single-Source Innovation: Both host and device code in the same source file, using modern C++ features (classes, templates, lambda functions). The SYCL runtime automatically manages kernel execution and data transfers through buffer and accessor objects, abstracting hardware complexity from the developer.2930

SYCL Architecture:

ComponentFunction
Queue ObjectSubmits work to compute devices30
Command GroupEncapsulates kernel and data dependencies (defined in lambda)30
Buffer & Accessor ObjectsDescribe and manage data transfers automatically29
BackendOperates on top of OpenCL, CUDA, or other compute APIs29

This single-source approach enhances programmability and code maintainability for complex heterogeneous systems, representing a significant step in abstracting hardware complexity from the developer.29

9.1.5 Comparative Analysis of Programming Models

Section titled “9.1.5 Comparative Analysis of Programming Models”

The choice of a parallel programming model requires balancing trade-offs between performance, programmability, portability, and the target hardware. The evolution from low-level models such as Pthreads, which provide maximum control at the cost of high complexity, to high-level abstractions like SYCL, which prioritize programmer productivity and portability, directly addresses the increasing complexity of parallel hardware.7 This progression is a necessary adaptation to make parallel programming more accessible. The optimal model is context-dependent, determined by the application’s specific requirements, the available hardware, and development resources. The following table summarizes the key models discussed, outlining their defining characteristics and typical use cases.

ModelPrimary Memory ModelParallelism TypeAbstraction LevelPrimary Use CaseProgrammability
PthreadsSharedTaskLowFine-grained thread control, systems programmingLow
OpenMPSharedData / TaskHighIncremental parallelization of loops on multi-core CPUsHigh
oneTBBSharedData / TaskHighC++ task-based parallelism with dynamic load balancingHigh
MPIDistributedData / TaskLowLarge-scale cluster and supercomputer programmingLow
CUDAHost-DeviceDataMediumHigh-performance computing on NVIDIA GPUsMedium
OpenCLHost-DeviceData / TaskMediumPortable parallel programming for heterogeneous acceleratorsMedium
SYCLHost-DeviceData / TaskHighSingle-source C++ for portable heterogeneous programmingHigh
  1. Models for Parallel Computing : Review and Perspectives - Semantic Scholar, accessed October 7, 2025, https://www.semanticscholar.org/paper/Models-for-Parallel-Computing-%3A-Review-and-Kessler-Keller/c924481fbb05bb807920c8f3f2f4d9234c9f1c29

  2. Insights on Parallel Programming Model - Advanced Millennium Technologies, accessed October 7, 2025, https://blog.amt.in/index.php/2023/01/17/insights-on-parallel-programming-model/ 2 3 4

  3. Shared Versus Distributed Memory Multiprocessors - ECMWF, accessed October 7, 2025, https://www.ecmwf.int/sites/default/files/elibrary/1990/10302-shared-versus-distributed-memory-multiprocessors.pdf 2

  4. Distributed memory - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Distributed_memory 2 3 4 5 6 7

  5. How Parallel Processing Shaped Modern Computing - CelerData, accessed October 7, 2025, https://celerdata.com/glossary/how-parallel-processing-shaped-modern-computing 2 3

  6. Parallel programming model - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Parallel_programming_model 2 3 4

  7. The Evolution of Parallel Programming | by Tiwariabhinav | Medium, accessed October 7, 2025, https://medium.com/@tiwariabhinav424/the-evolution-of-parallel-programming-d80665066b88 2 3 4 5 6 7

  8. Parallel Computing at a Glance, accessed October 7, 2025, http://www.buyya.com/microkernel/chap1.pdf

  9. Data Parallel, Task Parallel, and Agent Actor Architectures - bytewax, accessed October 7, 2025, https://bytewax.io/blog/data-parallel-task-parallel-and-agent-actor-architectures 2

  10. Task parallelism - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Task_parallelism 2 3

  11. Using OpenMP with C — Research Computing University of …, accessed October 7, 2025, https://curc.readthedocs.io/en/latest/programming/OpenMP-C.html 2 3 4

  12. An introduction to OpenMP - University College London, accessed October 7, 2025, https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html 2 3 4

  13. Multithreaded Programming (POSIX pthreads Tutorial) - randu.org, accessed October 7, 2025, https://randu.org/tutorials/threads/ 2 3 4

  14. C++ Tutorial: Multi-Threaded Programming - C++ Class Thread for Pthreads - 2020 - BogoToBogo, accessed October 7, 2025, https://www.bogotobogo.com/cplusplus/multithreading_pthread.php

  15. Getting Started with Intel® Threading Building Blocks (Intel® TBB), accessed October 7, 2025, https://www.intel.com/content/www/us/en/developer/articles/guide/get-started-with-tbb.html

  16. Introduction to the Intel Threading Building Blocks — mcs572 0.7.8 documentation, accessed October 7, 2025, http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec11.html 2 3

  17. TBB Tutorial - cs.wisc.edu, accessed October 7, 2025, https://pages.cs.wisc.edu/~gibson/tbbTutorial.html

  18. Introducing Intel Tbb - Agenda INFN, accessed October 7, 2025, https://agenda.infn.it/event/4107/contributions/50346/attachments/35473/41878/Tbb_Introduction.pdf

  19. Threading Building Blocks - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/Threading_Building_Blocks 2

  20. MPI Tutorial – Part 1, accessed October 7, 2025, https://spcl.inf.ethz.ch/Teaching/2017-dphpc/recitation/mpi1.pdf 2 3 4

  21. Tutorials · MPI Tutorial, accessed October 7, 2025, https://mpitutorial.com/tutorials/ 2 3 4 5

  22. (PDF) Review of Architecture and Model for Parallel Programming - ResearchGate, accessed October 7, 2025, https://www.researchgate.net/publication/372822306_Review_of_Architecture_and_Model_for_Parallel_Programming

  23. CUDA C++ Programming Guide | NVIDIA Docs, accessed October 7, 2025, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

  24. CUDA Programming Guide: An Overview | by Zia Babar | Aug, 2025 - Medium, accessed October 7, 2025, https://medium.com/@zbabar/cuda-programming-guide-an-overview-84be487cb5a8 2

  25. OpenCL - The Open Standard for Parallel Programming of …, accessed October 7, 2025, https://www.khronos.org/opencl/ 2 3 4

  26. OpenCL - Wikipedia, accessed October 7, 2025, https://en.wikipedia.org/wiki/OpenCL 2 3

  27. OpenCL execution model - Arm Developer, accessed October 7, 2025, https://developer.arm.com/documentation/dui0538/e/opencl-concepts/opencl-execution-model

  28. An introduction to OpenCL - Purdue Engineering, accessed October 7, 2025, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod20OpenCL/3rd-Edition-AppendixA-intro-to-OpenCL.pdf

  29. Heterogeneous programming with SYCL documentation, accessed October 7, 2025, https://enccs.github.io/sycl-workshop/ 2 3 4 5

  30. Getting Started - SYCL.tech, accessed October 7, 2025, https://sycl.tech/getting-started/ 2 3