Skip to content

8.3 Parallel Algorithm Patterns

While data and task parallelism provide high-level strategies, parallel algorithm patterns offer concrete, reusable templates for structuring parallel programs. These patterns encapsulate common computational structures and communication schemes, providing a higher level of abstraction than raw threads or message-passing calls.1 Three of the most influential patterns are MapReduce, Fork/Join, and Pipeline.

MapReduce is a data-parallel programming model for processing large datasets on distributed clusters.2 The pattern abstracts the details of distributed computing—such as parallelization, data distribution, fault tolerance, and load balancing—into a simple interface based on two user-defined functions: Map and Reduce.3

The MapReduce workflow proceeds through a sequence of defined stages:2

  1. Input Splitting: The input data is partitioned into smaller, independent chunks called “splits.”
  2. Map Phase: The framework spawns many “Map” tasks in parallel. Each Map task reads a split and applies the user-defined Map function to it, emitting a set of intermediate key-value pairs. For a “word count” example, the Map function would read a line of text and emit a (word, 1) pair for each word.4
  3. Shuffle and Sort Phase: The framework collects all intermediate key-value pairs from all Map tasks, sorts them by key, and groups all values associated with the same key. It then delivers these grouped values to a single Reduce task. For word count, all (word, 1) pairs for a specific “word” are sent to the same reducer as (word, [1, 1, 1,...]).2
  4. Reduce Phase: The framework spawns a set of “Reduce” tasks. Each Reduce task receives the grouped key-value lists and applies the user-defined Reduce function to aggregate, summarize, or transform the list of values. For word count, the Reduce function sums the list of ones to get the total count for that word.5
  5. Final Result: The output from all Reduce tasks is collected and written to the distributed file system.

The strength of MapReduce lies in its scalability and fault tolerance. It allows programmers to focus on the Map and Reduce logic, while the framework handles the distribution of work and recovery from node failures.3

The MapReduce Programming Model Credit: Asadianfam, Shiva & Shamsi, Mahboubeh & Rasouli Kenari, Abdolreza. (2021). TVD-MRDL: traffic violation detection system using MapReduce-based deep learning for large-scale data. Multimedia Tools and Applications. 80. 1-28. 10.1007/s11042-020-09714-8.6

Algorithm: MapReduce_WordCount(documents)
Input: Collection of text documents
Output: Word frequency counts
// User-defined Map function
function Map(document_id, document_text)
for each word in document_text do
emit(word, 1) // Emit (key, value) pair
end
// User-defined Reduce function
function Reduce(word, list_of_counts)
total = 0
for each count in list_of_counts do
total = total + count
end
emit(word, total)
// MapReduce Framework automatically handles:
// 1. Splitting input data across map workers
parallel for each document split do
Map(split_id, split_content)
end parallel
// 2. Shuffling and sorting by key
intermediate_data = shuffle_and_sort(map_outputs)
// 3. Distributing to reduce workers
parallel for each unique_word in intermediate_data do
Reduce(unique_word, intermediate_data[unique_word])
end parallel
return final_results

Example execution:

  • Input: [“hello world”, “hello hadoop”]
  • After Map: [(“hello”, 1), (“world”, 1), (“hello”, 1), (“hadoop”, 1)]
  • After Shuffle: [(“hadoop”, [1]), (“hello”, [1, 1]), (“world”, [1])]
  • After Reduce: [(“hadoop”, 1), (“hello”, 2), (“world”, 1)]

The Fork/Join pattern is a task-parallel model for implementing divide-and-conquer algorithms on shared-memory multi-core systems.7 It is designed to exploit parallelism in problems that can be recursively broken down into smaller, independent sub-problems.

The pattern’s execution flow mirrors the recursive nature of divide-and-conquer algorithms:8

  1. Fork: A large task checks if it is small enough to be solved sequentially. If it is too large, it is “forked” by splitting it into two or more smaller, independent sub-tasks, which are then scheduled for parallel execution.
  2. Recursive Decomposition: The sub-tasks recursively fork into smaller tasks until they are below a defined threshold.
  3. Join: Once a task has forked its sub-tasks, it waits for them to complete. The join operation blocks the parent task until the results of its children are available. The parent then combines the results from its children to produce its own result.

A key implementation detail that makes the Fork/Join pattern efficient is its use of a work-stealing scheduler.9 A pool of worker threads is created (typically one per CPU core), and each thread maintains its own double-ended queue (deque) of tasks. When a thread’s deque is empty, it “steals” a task from the tail of another thread’s deque. This provides dynamic load balancing, ensuring that all cores remain busy.10 This pattern is well-suited for problems like parallel merge sort and quicksort.11

Fork-Join Model

Credit: https://en.wikipedia.org/wiki/Fork-join_model

Algorithm: ForkJoin_ParallelSum(array, start, end, threshold)
Input: Array of numbers, start and end indices, threshold for sequential execution
Output: Sum of array elements
if (end - start) <= threshold then
// Base case: compute sequentially
sum = 0
for i = start to end do
sum = sum + array[i]
end
return sum
else
// Recursive case: divide and conquer
mid = (start + end) / 2
// Fork: create two subtasks
left_task = spawn ForkJoin_ParallelSum(array, start, mid, threshold)
right_task = spawn ForkJoin_ParallelSum(array, mid+1, end, threshold)
// Join: wait for results and combine
left_sum = await left_task
right_sum = await right_task
return left_sum + right_sum
end
// Work-Stealing Scheduler (implicit)
// - Each thread has a local deque of tasks
// - Threads push/pop from their own deque (LIFO)
// - Idle threads steal from other deques (FIFO from opposite end)

Time Complexity: O(n/p + log n) where p is the number of processors Space Complexity: O(log n) for recursion depth

The Pipeline pattern is a form of task parallelism modeled after an assembly line. It is used when a computation can be broken down into a sequence of distinct, independent stages, where the output of one stage is the input for the next.12

A pipeline consists of three main components:13

  1. Source: The entry point that produces or reads a stream of data items.
  2. Stages: A series of processing units, where each stage performs a specific operation on a data item.
  3. Sink: The exit point that consumes the final processed data items.

Parallelism is achieved through temporal concurrency. While Stage 1 is processing data item i+1, Stage 2 can simultaneously process data item i, and Stage 3 can process item i-1. After an initial “fill” period, all stages of the pipeline can be active at once, each working on a different data item.14 This structure is ideal for increasing the throughput (the rate at which data items are processed), though it does not necessarily reduce the latency (the time for a single data item to pass through the pipeline).

Pipeline Parallelism Credit: https://patterns.eecs.berkeley.edu/?page_id=542

Example: Pipeline Pattern for Image Processing

Section titled “Example: Pipeline Pattern for Image Processing”
Algorithm: Pipeline_ImageProcessing(image_stream)
Input: Stream of images
Output: Processed images
// Define pipeline stages
shared queue queue1, queue2, queue3
Stage 1: ImageReader()
while has_more_images do
image = read_next_image(image_stream)
enqueue(queue1, image)
end
Stage 2: NoiseReduction()
while true do
image = dequeue(queue1)
filtered_image = apply_noise_filter(image)
enqueue(queue2, filtered_image)
end
Stage 3: EdgeDetection()
while true do
image = dequeue(queue2)
edges = detect_edges(image)
enqueue(queue3, edges)
end
Stage 4: FeatureExtraction()
while true do
edges = dequeue(queue3)
features = extract_features(edges)
output(features)
end
// Execute all stages concurrently
parallel
spawn_thread(ImageReader)
spawn_thread(NoiseReduction)
spawn_thread(EdgeDetection)
spawn_thread(FeatureExtraction)
end parallel
// Throughput = 1 / max(stage_time_1, stage_time_2, stage_time_3, stage_time_4)
// Latency = stage_time_1 + stage_time_2 + stage_time_3 + stage_time_4

The performance of a pipeline is limited by its slowest stage, which determines the overall throughput.14 The pattern is used in CPU instruction execution, signal and image processing, data streaming analytics, and compilers.15

These patterns formalize different dependency structures. MapReduce is ideal for embarrassingly parallel tasks followed by a single aggregation. The Pipeline pattern is suited for problems with a linear dependency graph. The Fork/Join pattern is designed for problems with a recursive, hierarchical dependency graph. These patterns are also connected to the hardware architectures they were designed for. MapReduce is for large-scale, fault-tolerant distributed computing.16 Fork/Join is for shared-memory multi-core computing.11 The Pipeline pattern is more general but is fundamentally about optimizing throughput.

FeatureMapReduceFork/JoinPipeline
Core ConceptData-parallel model for batch processing large datasets using Map and Reduce functions.2Task-parallel model for executing recursive, divide-and-conquer algorithms.7Task-parallel model where data items flow through a sequence of processing stages.14
Primary GoalScalable processing of massive datasets with high fault tolerance.Efficiently parallelize recursive algorithms on shared-memory multi-core systems.Maximize throughput by overlapping execution of sequential stages.
Parallelism TypePrimarily Data Parallelism.Task Parallelism (Divide and Conquer).Task Parallelism (Assembly Line).
Execution FlowMap -> Shuffle & Sort -> Reduce. Highly structured and synchronous between stages.Fork (split task) -> Recursive execution -> Join (combine results). Asynchronous task execution.Data items processed sequentially through Stage 1 -> Stage 2 -> … -> Stage N.
CommunicationHandled implicitly by the framework during the shuffle/sort phase.Handled implicitly by the framework through shared memory for task parameters and results.Explicitly managed via queues or buffers between pipeline stages.
Key ChallengeHandling data skew, designing efficient Map/Reduce functions, I/O bottleneck.Determining the optimal sequential threshold, balancing work in non-uniform tasks.Balancing the workload of each stage (slowest stage determines throughput), buffer management.
Hardware TargetDistributed-memory systems, commodity clusters (e.g., Hadoop).16Shared-memory multi-core systems.11Shared-memory systems, or distributed systems for coarse-grained pipelines.
Ideal Use CasesLog processing, web indexing, data mining, scientific data analysis.Parallel sorting (Quicksort, Mergesort), financial modeling, search algorithms.Image/video processing, CPU instruction pipelines, stream analytics, compilers.15

Table 8.3.1: Comparative Analysis of Parallel Algorithm Patterns. This table provides a side-by-side comparison of the three primary parallel algorithm patterns. It highlights their core concepts, typical use cases, and key architectural trade-offs, helping designers choose the right pattern for a given problem.151

  1. Parallel Algorithm Design Strategies | Parallel and Distributed Computing Class Notes | Fiveable, accessed October 6, 2025, https://fiveable.me/parallel-and-distributed-computing/unit-6/parallel-algorithm-design-strategies/study-guide/B9NPGnrWtPEbON5O 2

  2. What is MapReduce? – IBM, accessed October 6, 2025, https://www.ibm.com/think/topics/mapreduce 2 3 4

  3. An Introduction to MapReduce with Map Reduce Example – Analytics Vidhya, accessed October 6, 2025, https://www.analyticsvidhya.com/blog/2022/05/an-introduction-to-mapreduce-with-a-word-count-example/ 2

  4. 4.1 MapReduce — Parallel Computing for Beginners, accessed October 6, 2025, https://www.learnpdc.org/PDCBeginners/6-furtherAvenues/mapreduce.html

  5. Map Reduce and its Phases with numerical example. – GeeksforGeeks, accessed October 6, 2025, https://www.geeksforgeeks.org/data-science/mapreduce-understanding-with-real-life-example/

  6. Asadianfam, Shiva & Shamsi, Mahboubeh & Rasouli Kenari, Abdolreza. (2021). TVD-MRDL: traffic violation detection system using MapReduce-based deep learning for large-scale data. Multimedia Tools and Applications. 80. 1-28. 10.1007/s11042-020-09714-8.

  7. Parallel programming: Using the Fork-Join model in Salesforce – West Monroe, accessed October 6, 2025, https://www.westmonroe.com/insights/parallel-programming-using-the-fork-Join-model-in-salesforce 2

  8. Fork–join model – Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Fork%E2%80%93join_model

  9. Introduction to the Fork/Join Framework – Pluralsight, accessed October 6, 2025, https://www.pluralsight.com/resources/blog/guides/introduction-to-the-fork-join-framework

  10. Fork/Join – Essential Java Classes – Oracle Help Center, accessed October 6, 2025, https://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html

  11. CS 365: Lecture 13: Fork/Join Parallelism, accessed October 6, 2025, https://ycpcs.github.io/cs365-spring2017/lectures/lecture13.html 2 3

  12. Types of parallelism – Arm Immortalis and Mali GPU OpenCL Developer Guide, accessed October 6, 2025, https://developer.arm.com/documentation/101574/latest/Parallel-processing-concepts/Types-of-parallelism

  13. The Pipeline Design Pattern – Examples in C# | HackerNoon, accessed October 6, 2025, https://hackernoon.com/the-pipeline-design-pattern-examples-in-c

  14. Pipeline | Our Pattern Language, accessed October 6, 2025, https://patterns.eecs.berkeley.edu/?page_id=542 2 3

  15. 9.3. Parallel Design Patterns — Computer Systems Fundamentals, accessed October 6, 2025, https://w3.cs.jmu.edu/kirkpams/OpenCSF/Books/csf/html/ParallelDesign.html 2 3

  16. MapReduce 101: What It Is & How to Get Started | Talend, accessed October 6, 2025, https://www.talend.com/resources/what-is-mapreduce/ 2