8.1 Abstract Models of Parallel Computation
To reason about the efficiency and correctness of parallel algorithms, designers rely on abstract models of computation. These models provide a simplified, formal framework that captures the essential features of a parallel machine while omitting low-level architectural details. This abstraction is crucial; it allows for the development of algorithms that are not tied to a specific piece of hardware, enabling analysis that can predict performance and reveal fundamental computational limits.3 A model of parallel computation typically consists of a programming model, which defines the basic operations and their effects, and a corresponding cost model, which associates a cost (e.g., time) with those operations.3 The evolution of these models reflects the historical progression of the high-performance computing field itself, shifting focus from pure concurrency to the increasingly critical bottlenecks of communication and synchronization.
The Parallel Random Access Machine (PRAM): An Idealized Foundation
Section titled “The Parallel Random Access Machine (PRAM): An Idealized Foundation”The Parallel Random Access Machine (PRAM) was one of the earliest and most influential models for parallel computation, proposed as a natural extension of the sequential Random Access Machine (RAM) model.6 Its purpose was to provide a simple, clean theoretical framework for exploring the logical structure of parallel computation, deliberately divorced from the complexities of inter-processor communication and synchronization.8
Definition and Core Principles
Section titled “Definition and Core Principles”A PRAM consists of a set of an unbounded number of identical processors and a single, global shared memory to which all processors have uniform access.5 The key assumptions of the PRAM model are 7:
- Unbounded Processors: There is no limit to the number of processors that can be applied to a problem.
- Shared Global Memory: All processors communicate and coordinate through a common memory space.
- Synchronous Execution: All processors operate in lock-step, driven by a common clock. In each step, a processor can read from a memory location, perform a local computation, and write to a memory location.
- Unit-Time Memory Access: Any memory access, whether a read or a write, from any processor to any memory location, completes in a single time unit.
This idealized environment allows algorithm designers to focus exclusively on concurrency—that is, identifying which parts of a computation can be performed simultaneously.6
PRAM Variants (Handling Memory Access Conflicts)
Section titled “PRAM Variants (Handling Memory Access Conflicts)”The synchronous, unit-time access of the PRAM model introduces the possibility of memory access conflicts, where multiple processors attempt to read from or write to the same memory location in the same clock cycle. The PRAM model is therefore classified into several variants based on how it resolves these conflicts.6
- Exclusive Read, Exclusive Write (EREW): This is the most restrictive and weakest PRAM variant. It forbids any simultaneous access to a single memory location. An algorithm for an EREW PRAM must explicitly ensure that no two processors ever attempt to read from or write to the same address at the same time.6
- Concurrent Read, Exclusive Write (CREW): This model relaxes the read constraint, allowing any number of processors to read from the same memory location simultaneously. However, write access remains exclusive; only one processor can write to a given location at a time.6
- Concurrent Read, Concurrent Write (CRCW): This is the most powerful PRAM variant, allowing simultaneous reads and writes to the same memory location. Because a concurrent write introduces ambiguity about which value should be stored, the CRCW model is further divided into sub-variants that define a write-conflict resolution protocol 6:
- Common CRCW: All processors attempting to write to the same location must be writing the exact same value. If they attempt to write different values, the behavior is illegal.
- Arbitrary CRCW: If multiple processors write to the same location, one is chosen arbitrarily, and its value is written. The algorithm cannot make any assumptions about which processor will “win.”
- Priority CRCW: Each processor is assigned a unique priority (often based on its processor ID), and the processor with the highest priority succeeds in writing its value.
It can be shown that a more powerful PRAM variant can simulate a weaker one. For instance, any algorithm for a CRCW PRAM can be simulated on an EREW PRAM with the same number of processors, but with a time complexity increase of a factor of O(logP), where P is the number of processors.8
Theoretical Significance and Limitations
Section titled “Theoretical Significance and Limitations”The PRAM model’s primary significance is theoretical. It serves as a benchmark for determining the maximum possible parallelism inherent in a problem, ignoring all physical constraints.7 By analyzing algorithms in the PRAM model, researchers can classify problems based on their parallel complexity. For example, the class NC (Nick’s Class) includes problems that can be solved on a PRAM in polylogarithmic time (O(logkn)) using a polynomial number of processors (nO(1)).8 This class represents problems that are considered highly parallelizable.
However, the PRAM is an idealized model that is not physically realizable.8 Its assumption of unit-time access to a global memory for any number of processors is its greatest strength for theoretical analysis and its fatal flaw for practical application. To connect P processors to M memory locations with constant-time access would require a crossbar switch of complexity O(PM), which is prohibitively expensive and does not scale.10 PRAM algorithms, by ignoring communication costs, often encourage the development of “overly fine-grained” solutions that perform an excessive number of communication steps. In practice, these algorithms perform poorly because communication is far more expensive than computation.14
The Bulk Synchronous Parallel (BSP) Model: A Bridging Approach
Section titled “The Bulk Synchronous Parallel (BSP) Model: A Bridging Approach”The practical failures of the PRAM model led to the development of frameworks that explicitly account for the costs of communication and synchronization. The Bulk Synchronous Parallel (BSP) model, proposed by Leslie Valiant, was designed as a “bridging model” to sit between the pure abstraction of PRAM and the complex details of specific hardware architectures.15
Components and the Superstep
Section titled “Components and the Superstep”A BSP computer is an abstract machine consisting of three main components 15:
- A set of processor-memory pairs.
- A communication network that delivers messages between processors.
- A hardware facility for efficient barrier synchronization of all processors.
Computation in the BSP model proceeds in a sequence of supersteps. Each superstep is composed of three ordered, sequential phases 15:
- Local Computation: Every processor performs computations independently, using only the data stored in its local memory.
- Communication: Processors exchange data by sending and receiving messages. The model treats all communication actions within a superstep as a single logical unit.
- Barrier Synchronization: A global barrier ensures that all processors have finished their computation and communication phases. Once all processors have reached the barrier, all messages sent during the superstep are guaranteed to have arrived at their destinations, and the next superstep can begin.
This superstep structure enforces a disciplined, bulk-synchronous style of programming, which avoids the potential for deadlock and simplifies reasoning about the program’s state.15
Cost Analysis
Section titled “Cost Analysis”The BSP model introduces a simple yet powerful cost model to analyze algorithm performance. The total cost of a BSP algorithm is the sum of the costs of its supersteps. The cost of a single superstep is determined by three parameters 15:
- w: The maximum number of floating-point operations (flops) performed by any single processor during the local computation phase.
- g: A parameter representing the time per word to deliver data under continuous traffic, effectively modeling the inverse of the communication network’s bandwidth.
- l: A parameter representing the latency of the barrier synchronization, which includes both network latency and the overhead of ensuring all processors have reached the barrier.
A superstep in which the maximum number of words sent or received by any processor is h is called an h-relation. The cost for this superstep is then calculated as:
Costsuperstep=w+hg+l
The goal of a BSP algorithm designer is to structure the computation to balance the work (w) and communication (h) across processors and, crucially, to minimize the total number of supersteps, as each one incurs a latency cost l.17
The LogP Model: Capturing Realistic Machine Constraints
Section titled “The LogP Model: Capturing Realistic Machine Constraints”While BSP provides a valuable bridge, some researchers sought a model with even greater fidelity to the characteristics of real distributed-memory machines. The LogP model, developed by Culler, Karp, Patterson, and others, was a direct response to the PRAM’s over-simplification, offering a more nuanced view of communication costs without being tied to a specific network topology.14
Defining Parameters
Section titled “Defining Parameters”The LogP model characterizes a parallel machine using four parameters that capture the physical realities of communication in a distributed system 17:
- L (Latency): An upper bound on the delay, or latency, incurred in communicating a small message from its source processor to its target processor.
- o (Overhead): The time a processor is engaged in the transmission or reception of a message. During this time, the processor is busy with communication and cannot perform other computations.
- g (Gap): The minimum time interval between consecutive message transmissions or receptions at a processor. The reciprocal, 1/g, corresponds to the available per-processor communication bandwidth.
- P (Processors): The number of processor-memory modules.
Analysis and Usage
Section titled “Analysis and Usage”The LogP model provides a framework for designing algorithms that are aware of and can adapt to these physical constraints. For example, the time to send a single small message from a source to a destination can be modeled as 2o+L: one overhead cost at the sender, the network latency, and one overhead cost at the receiver.17 The gap parameter, g, dictates how quickly a processor can inject messages into the network.
This level of detail allows designers to reason about techniques like overlapping computation with communication, a critical optimization in modern systems that is not captured by the simpler BSP model.24 The primary strength of LogP is its ability to guide the development of highly efficient, low-level communication routines.17 However, this fidelity comes at the cost of increased complexity in analysis. The model is most accurate at the machine-instruction level and can be less precise for systems with complex, multi-layered communication protocols like TCP/IP.17 To address limitations such as fixed message sizes, extensions like LogGP were introduced, which add a parameter G to model the per-byte cost for long messages.24
The progression from PRAM to BSP and finally to LogP is a narrative of the parallel computing community’s evolving understanding of performance bottlenecks. PRAM initiated the field by focusing on the limits of pure concurrency, assuming communication was free. The practical failure of this assumption on real hardware, where communication is the dominant cost, necessitated models that explicitly parameterize this overhead. BSP was the first major step, abstracting communication and synchronization into global parameters per superstep. LogP further refined this by deconstructing communication into its constituent parts—latency, processor overhead, and bandwidth limitations—reflecting the finer-grained realities of message-passing systems. This evolution demonstrates a clear shift in focus from “how many operations can we do at once?” to “how do we manage the costs of moving data?”.
These models also represent a fundamental trade-off between abstraction and fidelity. PRAM offers maximum abstraction, making it simple to reason with and an excellent tool for theoretical exploration and education.3 This simplicity, however, comes at the cost of fidelity, as PRAM-optimal algorithms can perform poorly in practice by encouraging pathologically high communication patterns.14 At the other end of the spectrum, LogP offers high fidelity, enabling the design of algorithms that are finely tuned to machine characteristics.14 The price of this fidelity is analytical complexity. BSP stands as the “bridging model” because it strikes a balance, offering more realism than PRAM but greater simplicity than LogP, making it a pragmatic choice for designing portable, general-purpose parallel algorithms.15
Feature | Parallel Random Access Machine (PRAM) | Bulk Synchronous Parallel (BSP) | LogP Model |
---|---|---|---|
Core Concept | An idealized shared-memory machine with synchronous, unit-time memory access. | A “bridging model” that structures computation into a sequence of “supersteps.” | A distributed-memory model capturing realistic constraints of message passing. |
Memory Abstraction | Single, global shared memory. | Processor-local memories. | Distributed, per-processor local memories. |
Communication Model | Implicit and free (unit-time memory access). | Explicit message passing within a superstep, costed globally. | Explicit point-to-point messages, costed with detailed parameters. |
Synchronization | Implicit, lock-step execution across all processors. | Explicit, coarse-grained barrier synchronization at the end of each superstep. | Implicit in message arrival; no built-in global synchronization. |
Key Parameters | Number of processors (P), memory access rules (EREW, CREW, CRCW). | P (processors), g (gap/bandwidth), l (latency/sync cost). | L (latency), o (overhead), g (gap), P (processors). |
Primary Strength | Simplicity; allows focus on pure concurrency and establishing theoretical bounds. | Balances abstraction and realism; good for portable algorithm design. | High fidelity to hardware; enables fine-tuning and overlapping communication/computation. |
Primary Weakness | Unrealistic; ignores dominant costs of communication and synchronization. | Coarse-grained synchronization can be inefficient; less detailed than LogP. | More complex to analyze; less accurate for high-level communication protocols. |
Table 8.1.1: Comparative Analysis of Abstract Parallel Computation Models. This table summarizes the key characteristics of the PRAM, BSP, and LogP models, highlighting their different approaches to modeling memory, communication, and synchronization, and outlining their respective strengths and weaknesses in the context of parallel algorithm design.17