8.4 Case Study: Parallel Sorting Algorithms
Sorting is a fundamental problem in computer science, serving not only as a critical component in countless applications but also as a canonical benchmark for evaluating the performance of parallel algorithms and architectures. The goal of parallel sorting is to significantly reduce the time required to sort a large dataset by distributing the work across multiple processors. While the best sequential comparison-based sorting algorithms have a time complexity of O(nlogn), the ideal parallel algorithm with p processors aims for a time complexity of O(pnlogn).57 This case study examines three distinct parallel sorting algorithms—Parallel Merge Sort, Bitonic Sort, and Sample Sort—each illustrating a different design strategy and highlighting the trade-offs between computation, communication, and architectural suitability.
Parallel Merge Sort
Section titled “Parallel Merge Sort”Merge sort’s inherent divide-and-conquer structure makes it a natural candidate for parallelization using the Fork/Join pattern.58 The algorithm can be parallelized at two key stages: the recursive subdivision and the merging process.
Methodology
Section titled “Methodology”The most straightforward parallelization of merge sort involves executing the two recursive calls in parallel. Given an array to sort, the algorithm splits it into two halves. One processor (or thread) can recursively sort the left half while another simultaneously sorts the right half.59 This process continues until the sub-arrays are small enough to be sorted sequentially. The primary challenge then becomes the parallel merge step, where the two sorted sub-arrays must be combined into a single sorted array. A simple sequential merge is a bottleneck. A more advanced parallel merge algorithm works by having each of the p available processors take an element from one of the sorted sub-arrays (say, A) and use binary search to find its correct position in the other sub-array (B). The final position of the element in the merged output is the sum of its indices in A and B. By performing this for all elements in parallel, the merge step can be executed efficiently.60 A scalable implementation for distributed memory often involves each of the p processors sorting a local chunk of n/p elements, followed by a tree-based merging phase where pairs of processors merge their sorted lists in successive rounds.61
Bitonic Sort
Section titled “Bitonic Sort”Bitonic sort, developed by Ken Batcher, is a parallel sorting algorithm based on the concept of a sorting network.62 A key feature of this algorithm is that the sequence of comparisons is fixed and does not depend on the input data, making it a data-oblivious algorithm. This property makes it exceptionally well-suited for implementation in hardware and on highly parallel, synchronous architectures like GPUs.63
Methodology
Section titled “Methodology”The algorithm operates in two main phases, requiring the input size n to be a power of two:
- Building a Bitonic Sequence: The algorithm first transforms the unsorted input into a bitonic sequence. A bitonic sequence is one that first monotonically increases and then monotonically decreases, or a circular shift of such a sequence. This is done by recursively sorting smaller sub-arrays, half in ascending order and half in descending order, and then concatenating them.
- Bitonic Merging: The bitonic sequence is then sorted using a bitonic merge operation. This operation recursively compares and swaps elements at a distance of n/2, then n/4, and so on, until all elements are sorted. For a bitonic sequence a, comparing and swapping a[i] with a[i + n/2] for all i<n/2 results in two smaller bitonic sequences, where all elements in the first half are smaller than all elements in the second half. These two halves are then sorted recursively in parallel.
The parallel time complexity of bitonic sort is O(log2n) using O(n) processors.63 While it performs more total comparisons than algorithms like merge sort, its highly regular and data-independent communication pattern allows for very efficient implementation on SIMD-style architectures where processors execute instructions in lock-step.64
Sample Sort
Section titled “Sample Sort”Sample sort is a powerful parallel sorting algorithm designed primarily for distributed-memory systems. It is a generalization of quicksort that uses multiple pivots, called splitters, to partition the data into many buckets, addressing the critical challenge of load balancing that plagues simple parallel quicksort implementations.65
Methodology
Section titled “Methodology”The algorithm proceeds in several distinct phases 65:
- Sampling: Each of the p processors randomly selects a small number of samples from its local data. The number of samples is often chosen using an oversampling factor k, such that each processor selects k(p−1) samples.
- Splitter Selection: The samples from all processors are gathered to one processor (or all processors). These samples are sorted, and p−1 splitters are chosen from the sorted list at regular intervals. These splitters are intended to represent the overall distribution of the input data.
- Data Partitioning (Bucketing): The p−1 splitters are broadcast to all processors. Each processor then uses these splitters to partition its local data into p buckets. Bucket j on processor i contains all local elements that fall between splitter j−1 and splitter j.
- Data Redistribution: A global all-to-all communication phase occurs. Each processor i sends its j-th bucket to processor j. After this step, processor j holds all elements from the entire dataset that belong in the j-th bucket.
- Local Sort: Each processor sorts the data in its final bucket using a fast sequential sorting algorithm.
- Concatenation: Since the splitters divide the entire key range, the sorted data across the processors is now globally sorted. The concatenation of the sorted buckets from processor 0 to p−1 forms the final sorted array.
The success of sample sort hinges on the quality of the splitters. By oversampling, the algorithm increases the probability that the splitters will partition the data into nearly equal-sized buckets, thus ensuring a balanced workload in the final local sort phase, which is typically the most computationally expensive part of the algorithm.65 This upfront investment in sampling and communication to achieve load balance is what makes sample sort highly scalable on distributed systems. The choice among these algorithms reveals a fundamental trade-off between generality and architectural specialization. Parallel merge sort is a general-purpose algorithm, representing an intuitive parallelization of a classic sequential method that performs well on shared-memory systems. In contrast, bitonic sort and sample sort are highly specialized. Bitonic sort’s data-oblivious, regular structure is optimized for the synchronous, massively parallel nature of GPUs and custom hardware.63 Sample sort is tailored for distributed-memory clusters, with its entire design—sampling, partitioning, and a single all-to-all communication phase—engineered to overcome high network latency and the critical need for load balancing across independent nodes.65 This demonstrates that in parallel computing, there is no universally “best” algorithm; the optimal choice is a function of the target hardware. Furthermore, this case study highlights a crucial shift in thinking from sequential to parallel design. While sequential sorting algorithms focus on minimizing the number of comparisons, the primary challenge in scalable parallel sorting is often not computation but rather communication and load balancing. A simple parallel quicksort can fail catastrophically if a poor pivot choice sends most of the data to one processor, leaving others idle and creating a sequential bottleneck.67 Sample sort’s design is a direct acknowledgment of this problem; its complex sampling and splitter selection phases are an upfront cost paid specifically to ensure a balanced distribution of work, thereby minimizing processor idle time and maximizing parallel efficiency.65
Feature | Parallel Merge Sort | Bitonic Sort | Sample Sort |
---|---|---|---|
Core Idea | Recursive “divide and conquer” where sub-problems are sorted in parallel and then merged. | A data-oblivious sorting network that builds and merges bitonic sequences. | A generalization of quicksort that uses multiple “splitters” to partition data into buckets for load balancing. |
Parallelism Strategy | Task Parallelism (Fork/Join). | Data Parallelism (SIMD). | Data Parallelism with coordinated phases. |
Time Complexity (Ideal) | O(logn) with O(n) processors. | O(log2n) with O(n) processors. | O(logn) with O(n) processors. |
Key Challenge | Efficiently parallelizing the merge step. | Higher number of total comparisons; requires input size to be a power of two. | Selecting good splitters to ensure load balance; high communication cost in redistribution phase. |
Best Suited Architecture | Shared-memory multi-core systems. | SIMD architectures, GPUs, FPGAs, and hardware sorting networks. | Distributed-memory clusters and supercomputers. |
Table 8.4.1: Comparative Analysis of Parallel Sorting Algorithms. This table summarizes the core concepts, performance characteristics, and ideal architectural contexts for Parallel Merge Sort, Bitonic Sort, and Sample Sort, illustrating the diverse strategies for tackling this fundamental problem.59