8.1 Models of Parallel Computation

Abstract models of computation are used to analyze the efficiency and correctness of parallel algorithms. These models provide a simplified, formal framework that captures the essential features of a parallel machine while omitting low-level architectural details. This abstraction allows for the development of hardware-independent algorithms and enables analysis that can predict performance and identify fundamental computational limits.¹ A model of parallel computation includes a programming model, which defines the basic operations, and a cost model, which assigns a cost (e.g., time) to those operations.¹ The focus of these models has evolved from pure concurrency to addressing the critical bottlenecks of communication and synchronization.

The Parallel Random Access Machine (PRAM)

The Parallel Random Access Machine (PRAM) is an early and influential model for parallel computation that extends the sequential Random Access Machine (RAM) model.² It provides a simplified theoretical framework for studying the logical structure of parallel computation by abstracting away the complexities of inter-processor communication and synchronization.³

Definition and Core Principles

A PRAM consists of an unbounded number of identical processors and a single, global shared memory with uniform access for all processors.⁴ The key assumptions of the PRAM model are:⁵

Unbounded Processors: An unlimited number of processors are available.
Shared Global Memory: All processors communicate through a common memory space.
Synchronous Execution: Processors operate in lock-step, driven by a common clock. In each step, a processor can read from memory, perform a local computation, and write to memory.
Unit-Time Memory Access: Any memory access from any processor to any memory location completes in a single time unit.

This idealized model allows algorithm designers to focus on concurrency by identifying which parts of a computation can be performed simultaneously.²

PRAM Variants (Handling Memory Access Conflicts)

The synchronous, unit-time access of the PRAM model can lead to memory access conflicts, where multiple processors attempt to access the same memory location in the same clock cycle. PRAM variants are classified by how they resolve these conflicts.²

Exclusive Read, Exclusive Write (EREW): The most restrictive variant, it forbids simultaneous access to a single memory location. An EREW PRAM algorithm must ensure that no two processors access the same address at the same time.²
Concurrent Read, Exclusive Write (CREW): This model allows any number of processors to read from the same memory location simultaneously, but write access remains exclusive.²
Concurrent Read, Concurrent Write (CRCW): The most powerful variant, it allows simultaneous reads and writes to the same location. Because concurrent writes create ambiguity, CRCW is subdivided based on the write-conflict resolution protocol:²
- Common CRCW: All processors writing to the same location must write the same value.
- Arbitrary CRCW: If multiple processors write to the same location, an arbitrary one succeeds.
- Priority CRCW: Each processor has a unique priority, and the one with the highest priority succeeds in writing its value.

A more powerful PRAM variant can simulate a weaker one. For example, a CRCW PRAM algorithm can be simulated on an EREW PRAM with the same number of processors, but with a time complexity increase of $O(\log P)$ , where P is the number of processors.³

Example: Parallel Maximum Finding on PRAM

A classic PRAM algorithm demonstrates concurrent read operations:

Algorithm: PRAM_FindMaximum(A[1..n])
Input: Array A with n elements, P = n processors
Output: Maximum element in A

// Phase 1: Each processor reads two elements (CREW)
parallel for i = 1 to n/2 do
    Processor i:
        local_max[i] = max(A[2*i-1], A[2*i])
    end
end parallel

// Phase 2: Binary reduction using log(n) steps
for level = 1 to log(n) do
    parallel for i = 1 to n/(2^(level+1)) do
        Processor i:
            local_max[i] = max(local_max[2*i-1], local_max[2*i])
        end
    end parallel
end

return local_max[1]

This algorithm runs in $O(\log n)$ time with $O(n)$ processors on a CREW PRAM, achieving optimal work efficiency of $O(n)$ .

Theoretical Significance and Limitations

The PRAM model is primarily of theoretical significance. It serves as a benchmark for determining the maximum possible parallelism in a problem by ignoring physical constraints.⁵ Analyzing algorithms in the PRAM model allows for the classification of problems by their parallel complexity. For instance, the class NC (Nick’s Class) includes problems solvable on a PRAM in polylogarithmic time ( $O(\log^k n)$ ) with a polynomial number of processors ( $n^{O(1)}$ ).³ This class represents problems considered to be highly parallelizable.

However, the PRAM is an idealized model that is not physically realizable.³ The assumption of unit-time access to a global memory for any number of processors is a major limitation for practical application. A crossbar switch with $O(PM)$ complexity would be required to connect P processors to M memory locations with constant-time access, which is prohibitively expensive and unscalable.⁶ By ignoring communication costs, PRAM algorithms can encourage fine-grained solutions with excessive communication, which perform poorly on real hardware where communication is more expensive than computation.⁷

The Bulk Synchronous Parallel (BSP) Model

The practical limitations of the PRAM model motivated the development of frameworks that explicitly account for communication and synchronization costs. The Bulk Synchronous Parallel (BSP) model, proposed by Leslie Valiant, was designed as a “bridging model” to connect the abstraction of PRAM with the details of specific hardware architectures.⁸

Components and the Superstep

A BSP computer is an abstract machine with three main components:⁸

A set of processor-memory pairs.
A communication network for message passing between processors.
A hardware facility for efficient barrier synchronization.

Computation in the BSP model proceeds in a sequence of supersteps. Each superstep consists of three ordered phases:⁸

Local Computation: Each processor performs computations independently using only data in its local memory.
Communication: Processors exchange data by sending and receiving messages.
Barrier Synchronization: A global barrier ensures all processors have completed their computation and communication. After the barrier, all messages are guaranteed to have arrived, and the next superstep begins.

This structure enforces a bulk-synchronous programming style, which simplifies reasoning about the program’s state and avoids issues like deadlock.⁸

Example: BSP Program Structure

Algorithm: BSP_VectorAddition(A[1..n], B[1..n], p processors)
Input: Vectors A and B, each of length n
Output: Vector C = A + B

// Divide work among processors
local_size = n / p

// Superstep: Local Computation
parallel for each processor i do
    start_idx = i * local_size
    for j = 0 to local_size - 1 do
        C[start_idx + j] = A[start_idx + j] + B[start_idx + j]
    end
end parallel

// Barrier synchronization (all processors finish)
barrier_sync()

return C

The cost model for this BSP algorithm would be:

Computation: $w = n/p$ operations per processor
Communication: $h = 0$ (no inter-processor communication needed)
Synchronization: $l$ (one barrier)

Cost Analysis

The BSP model uses a cost model to analyze algorithm performance. The total cost is the sum of the costs of its supersteps. The cost of a single superstep is determined by three parameters:⁸

w: The maximum number of floating-point operations (flops) performed by any single processor.
g: A parameter representing the time per word for data delivery, modeling the inverse of the communication network’s bandwidth.
l: A parameter representing the latency of the barrier synchronization.

A superstep where the maximum number of words sent or received by any processor is h is called an h-relation. The cost for this superstep is:

Cost_superstep = $w + hg + l$

A BSP algorithm should balance the work (w) and communication (h) across processors and minimize the total number of supersteps, as each incurs a latency cost l.⁹

The LogP Model

The LogP model provides a more detailed analysis of communication costs than BSP, without being tied to a specific network topology.⁷

Defining Parameters

The LogP model characterizes a parallel machine using four parameters:⁹

L (Latency): An upper bound on the delay for communicating a small message between processors.
o (Overhead): The time a processor is engaged in transmitting or receiving a message, during which it cannot perform other computations.
g (Gap): The minimum time interval between consecutive message transmissions or receptions at a processor. The reciprocal, $1/g$ , corresponds to the per-processor communication bandwidth.
P (Processors): The number of processor-memory modules.

Analysis and Usage

The LogP model allows for the design of algorithms that are aware of these physical constraints. For example, the time to send a small message can be modeled as $2o + L$ (overhead at sender, network latency, and overhead at receiver).⁹ The gap parameter, g, determines how quickly a processor can send messages.

This level of detail enables reasoning about techniques like overlapping computation with communication, which is not captured by the simpler BSP model.¹⁰ The strength of LogP is its ability to guide the development of efficient, low-level communication routines.⁹ However, this fidelity increases the complexity of analysis. The model is most accurate at the machine-instruction level and can be less precise for systems with complex communication protocols.⁹ Extensions like LogGP have been introduced to model the per-byte cost for long messages by adding a parameter G.¹⁰

The progression from PRAM to BSP and LogP reflects an increasing focus on performance bottlenecks in parallel computing. PRAM focused on concurrency, assuming free communication. The impracticality of this assumption led to models that explicitly parameterize communication overhead. BSP abstracts communication and synchronization into global parameters, while LogP further deconstructs communication into latency, overhead, and bandwidth limitations. This evolution indicates a shift from maximizing concurrent operations to managing data movement costs.

These models also represent a trade-off between abstraction and fidelity. PRAM offers maximum abstraction, making it simple for theoretical exploration. LogP offers high fidelity, enabling fine-tuning to machine characteristics. BSP serves as a bridging model, balancing realism and simplicity, making it a pragmatic choice for designing portable, general-purpose parallel algorithms.⁸

Feature	Parallel Random Access Machine (PRAM)	Bulk Synchronous Parallel (BSP)	LogP Model
Core Concept	An idealized shared-memory machine with synchronous, unit-time memory access.	A “bridging model” that structures computation into a sequence of “supersteps.”	A distributed-memory model capturing realistic constraints of message passing.
Memory Abstraction	Single, global shared memory.	Processor-local memories.	Distributed, per-processor local memories.
Communication Model	Implicit and free (unit-time memory access).	Explicit message passing within a superstep, costed globally.	Explicit point-to-point messages, costed with detailed parameters.
Synchronization	Implicit, lock-step execution across all processors.	Explicit, coarse-grained barrier synchronization at the end of each superstep.	Implicit in message arrival; no built-in global synchronization.
Key Parameters	Number of processors (P), memory access rules (EREW, CREW, CRCW).	P (processors), g (gap/bandwidth), l (latency/sync cost).	L (latency), o (overhead), g (gap), P (processors).
Primary Strength	Simplicity; allows focus on pure concurrency and establishing theoretical bounds.	Balances abstraction and realism; good for portable algorithm design.	High fidelity to hardware; enables fine-tuning and overlapping communication/computation.
Primary Weakness	Unrealistic; ignores dominant costs of communication and synchronization.	Coarse-grained synchronization can be inefficient; less detailed than LogP.	More complex to analyze; less accurate for high-level communication protocols.

Table 8.1.1: Comparative Analysis of Abstract Parallel Computation Models. This table summarizes the key characteristics of the PRAM, BSP, and LogP models, highlighting their different approaches to modeling memory, communication, and synchronization, and outlining their respective strengths and weaknesses in the context of parallel algorithm design.⁹

References

Models for Parallel Computing: Review and Perspectives - IDA.LiU.SE, accessed October 6, 2025, https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf ↩ ↩²
COMP 633: Parallel Computing PRAM Algorithms, accessed October 6, 2025, https://www.cs.unc.edu/~prins/Classes/633/Readings/pram.pdf ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
A Survey of Parallel Algorithms for Shared-Memory Machines, accessed October 6, 2025, https://www2.eecs.berkeley.edu/Pubs/TechRpts/1988/CSD-88-408.pdf ↩ ↩² ↩³ ↩⁴
PRAM or Parallel Random Access Machines - GeeksforGeeks, accessed October 6, 2025, https://www.geeksforgeeks.org/computer-organization-architecture/pram-or-parallel-random-access-machines/ ↩
Parallel Random-Access Machines - Computer Science, UWO, accessed October 6, 2025, https://www.csd.uwo.ca/~mmorenom/HPC-Slides/The_PRAM_model.pdf ↩ ↩²
Parallel Computation Models - Rice University, accessed October 6, 2025, https://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec20-s08-v1.pdf ↩
LogP: Towards a realistic Model of Parallel Computation, accessed October 6, 2025, https://users.cs.utah.edu/~kirby/classes/cs6230/CullerSlides.pdf ↩ ↩²
Bulk synchronous parallel - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Bulk_synchronous_parallel ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
RAM, PRAM, and LogP models, accessed October 6, 2025, https://www.cs.fsu.edu/~xyuan/cis4930-cda5125/lect23_logpbsp.ppt ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Design of Parallel and High-Performance Computing: Distributed-Memory Models and Algorithms, accessed October 6, 2025, https://spcl.inf.ethz.ch/Teaching/2015-dphpc/lecture/lecture12-loggp ↩ ↩²