8.4 Parallel Sorting Algorithms
Sorting is a fundamental problem in computer science that serves as a critical component in many applications and as a canonical benchmark for evaluating the performance of parallel algorithms and architectures. The goal of parallel sorting is to reduce the time required to sort a large dataset by distributing the work across multiple processors. While sequential comparison-based sorting algorithms have a time complexity of O(n log n), a parallel algorithm with p processors can ideally achieve a time complexity of O((n log n)/p).1 This case study examines three distinct parallel sorting algorithms—Parallel Merge Sort, Bitonic Sort, and Sample Sort—each illustrating a different design strategy and highlighting the trade-offs between computation, communication, and architectural suitability.
Parallel Merge Sort
Section titled “Parallel Merge Sort”Merge sort’s inherent divide-and-conquer structure makes it a natural candidate for parallelization using the Fork/Join pattern.2 The algorithm can be parallelized at two key stages: the recursive subdivision and the merging process.
Methodology
Section titled “Methodology”A direct parallelization of merge sort involves executing the two recursive calls in parallel. For a given array, the algorithm splits it into two halves. One processor sorts the left half while another simultaneously sorts the right half.3 This process continues until the sub-arrays are small enough for sequential sorting. The primary challenge is the parallel merge step, where two sorted sub-arrays must be combined into a single sorted array.
A simple sequential merge creates a bottleneck. An advanced parallel merge algorithm can have each of the p available processors take an element from one sorted sub-array (A) and use binary search to find its correct position in the other sub-array (B). The final position of the element in the merged output is the sum of its indices in A and B. By performing this for all elements in parallel, the merge step can be executed efficiently.4 A scalable implementation for distributed memory often involves each of the p processors sorting a local chunk of n/p elements, followed by a tree-based merging phase where pairs of processors merge their sorted lists in successive rounds.5
Algorithm: ParallelMergeSort(A[1..n], p processors)Input: Array A of n elements, p processorsOutput: Sorted array A
function ParallelMerge(A, B, C, p) // Parallel merge of sorted arrays A and B into C n_a = length(A) n_b = length(B)
parallel for i = 1 to n_a using p/2 processors do // Find position of A[i] in B using binary search pos_in_B = BinarySearch(B, A[i]) // Place A[i] in final position C[i + pos_in_B] = A[i] end parallel
parallel for j = 1 to n_b using p/2 processors do // Find position of B[j] in A using binary search pos_in_A = BinarySearch(A, B[j]) // Place B[j] in final position C[j + pos_in_A] = B[j] end parallel
function ParallelMergeSortRecursive(A, start, end, threshold) if (end - start) <= threshold then SequentialSort(A, start, end) return end
mid = (start + end) / 2
// Fork: sort two halves in parallel parallel spawn ParallelMergeSortRecursive(A, start, mid, threshold) spawn ParallelMergeSortRecursive(A, mid+1, end, threshold) end parallel
// Join: merge sorted halves ParallelMerge(A[start..mid], A[mid+1..end], A[start..end], p)
// Main algorithmchunk_size = n / p
// Phase 1: Each processor sorts its local chunkparallel for processor_id = 0 to p-1 do start = processor_id * chunk_size end = start + chunk_size - 1 SequentialSort(A, start, end)end parallel
// Phase 2: Tree-based parallel mergefor level = 1 to log(p) do step = 2^level parallel for i = 0 to p-1 step step do if i + step/2 < p then merge_start = i * chunk_size merge_mid = (i + step/2) * chunk_size merge_end = min((i + step) * chunk_size, n) ParallelMerge(A[merge_start..merge_mid-1], A[merge_mid..merge_end-1], A[merge_start..merge_end-1], p) end end parallelend
return ATime Complexity: O((n/p)log(n/p) + n log p) = O((n log n)/p + n log p) Communication: O(log p) merge phases
Bitonic Sort
Section titled “Bitonic Sort”Bitonic sort is a parallel sorting algorithm based on the concept of a sorting network.6 A key feature of this algorithm is that the sequence of comparisons is fixed and does not depend on the input data, making it a data-oblivious algorithm. This property makes it well-suited for implementation in hardware and on highly parallel, synchronous architectures like GPUs.7
Methodology
Section titled “Methodology”The algorithm operates in two main phases, requiring the input size n to be a power of two:
- Building a Bitonic Sequence: The algorithm first transforms the unsorted input into a bitonic sequence—a sequence that first monotonically increases and then monotonically decreases, or a circular shift thereof. This is done by recursively sorting smaller sub-arrays, half in ascending order and half in descending order, and then concatenating them.
- Bitonic Merging: The bitonic sequence is then sorted using a bitonic merge operation. This operation recursively compares and swaps elements at a distance of n/2, then n/4, and so on, until all elements are sorted. For a bitonic sequence
a, comparing and swappinga[i]witha[i + n/2]for alli < n/2results in two smaller bitonic sequences, where all elements in the first half are smaller than all elements in the second half. These two halves are then sorted recursively in parallel.
The parallel time complexity of bitonic sort is O(log² n) using O(n) processors.7 While it performs more total comparisons than algorithms like merge sort, its highly regular and data-independent communication pattern allows for efficient implementation on SIMD-style architectures where processors execute instructions in lock-step.8
Algorithm: BitonicSort(A[0..n-1], p processors)Input: Array A of n elements (n must be power of 2)Output: Sorted array in ascending order
function BitonicMerge(A, low, count, direction) // direction: 1 for ascending, 0 for descending if count > 1 then k = count / 2
// Compare and swap elements at distance k parallel for i = low to low + k - 1 do if (direction == 1 and A[i] > A[i+k]) or (direction == 0 and A[i] < A[i+k]) then swap(A[i], A[i+k]) end end parallel
// Recursively merge both halves parallel BitonicMerge(A, low, k, direction) BitonicMerge(A, low + k, k, direction) end parallel end
function BitonicSortRecursive(A, low, count, direction) if count > 1 then k = count / 2
// Sort first half in ascending order // Sort second half in descending order parallel BitonicSortRecursive(A, low, k, 1) BitonicSortRecursive(A, low + k, k, 0) end parallel
// Merge the resulting bitonic sequence BitonicMerge(A, low, count, direction) end
// Main executionBitonicSortRecursive(A, 0, n, 1)
return AExample execution for n=8:
- Level 1: Create 4 bitonic sequences of size 2
- Level 2: Merge into 2 bitonic sequences of size 4
- Level 3: Final merge into 1 sorted sequence of size 8
Total depth: log²(n) comparisons Parallelism: All comparisons at same distance can execute simultaneously
Sample Sort
Section titled “Sample Sort”Sample sort is a parallel sorting algorithm designed for distributed-memory systems. It is a generalization of quicksort that uses multiple pivots, called splitters, to partition the data into multiple buckets, addressing the load balancing challenge that affects simple parallel quicksort implementations.9
Methodology
Section titled “Methodology”The algorithm proceeds in several distinct phases:9
- Sampling: Each of the p processors randomly selects a small number of samples from its local data. The number of samples is often chosen using an oversampling factor k, such that each processor selects k(p−1) samples.
- Splitter Selection: The samples from all processors are gathered to one processor (or all processors). These samples are sorted, and p−1 splitters are chosen from the sorted list at regular intervals. These splitters are intended to represent the overall distribution of the input data.
- Data Partitioning (Bucketing): The p−1 splitters are broadcast to all processors. Each processor then uses these splitters to partition its local data into p buckets. Bucket j on processor i contains all local elements that fall between splitter j−1 and splitter j.
- Data Redistribution: A global all-to-all communication phase occurs. Each processor i sends its j-th bucket to processor j. After this step, processor j holds all elements from the entire dataset that belong in the j-th bucket.
- Local Sort: Each processor sorts the data in its final bucket using a fast sequential sorting algorithm.
- Concatenation: Because the splitters divide the entire key range, the sorted data across the processors is now globally sorted. The concatenation of the sorted buckets from processor 0 to p−1 forms the final sorted array.
The success of sample sort depends on the quality of the splitters. By oversampling, the algorithm increases the probability that the splitters will partition the data into nearly equal-sized buckets, thus ensuring a balanced workload in the final local sort phase.9 This upfront investment in sampling and communication to achieve load balance makes sample sort highly scalable on distributed systems.
Algorithm: SampleSort(A[1..n], p processors)Input: Array A distributed across p processorsOutput: Globally sorted array
oversampling_factor = 10samples_per_processor = oversampling_factor * (p - 1)
// Phase 1: Each processor samples its local dataparallel for each processor i = 0 to p-1 do local_samples[i] = randomly_select(samples_per_processor, A[i])end parallel
// Phase 2: Select splitters from all samplesall_samples = gather_and_sort(local_samples)for k = 1 to p-1 do splitters[k] = all_samples[k * samples_per_processor * oversampling_factor]endbroadcast(splitters)
// Phase 3: Partition local data into bucketsparallel for each processor i = 0 to p-1 do buckets[i][0..p-1] = empty for each element in A[i] do bucket_id = BinarySearch(splitters, element) buckets[i][bucket_id].append(element) endend parallel
// Phase 4: Exchange buckets (all-to-all communication)parallel for each processor i = 0 to p-1 do for j = 0 to p-1 do send(buckets[i][j], to_processor=j) end new_local_data[i] = receive_from_all_processors()end parallel
// Phase 5: Each processor sorts its received dataparallel for each processor i = 0 to p-1 do SequentialSort(new_local_data[i])end parallel
return concatenate(new_local_data[0], ..., new_local_data[p-1])Time Complexity:
- Sampling:
- Local sort:
- All-to-all: communication
- Total:
Load Balance: Depends on quality of splitters (improved by oversampling)
The choice among these algorithms reveals a fundamental trade-off between generality and architectural specialization. Parallel merge sort is a general-purpose algorithm that performs well on shared-memory systems. In contrast, bitonic sort and sample sort are more specialized. Bitonic sort’s data-oblivious, regular structure is optimized for the synchronous, massively parallel nature of GPUs and custom hardware.7 Sample sort is tailored for distributed-memory clusters, with a design engineered to overcome high network latency and the need for load balancing across independent nodes.9 This demonstrates that in parallel computing, the optimal algorithm is a function of the target hardware.
Furthermore, this case study highlights a crucial difference between sequential and parallel design. While sequential sorting algorithms focus on minimizing comparisons, the primary challenge in scalable parallel sorting is often communication and load balancing. A simple parallel quicksort can fail if a poor pivot choice sends most of the data to one processor, creating a sequential bottleneck.10 Sample sort’s design acknowledges this problem; its sampling and splitter selection phases are an upfront cost paid to ensure a balanced distribution of work, thereby minimizing processor idle time and maximizing parallel efficiency.9
| Feature | Parallel Merge Sort | Bitonic Sort | Sample Sort |
|---|---|---|---|
| Core Idea | Recursive “divide and conquer” where sub-problems are sorted in parallel and then merged. | A data-oblivious sorting network that builds and merges bitonic sequences. | A generalization of quicksort that uses multiple “splitters” to partition data into buckets for load balancing. |
| Parallelism Strategy | Task Parallelism (Fork/Join). | Data Parallelism (SIMD). | Data Parallelism with coordinated phases. |
| Time Complexity (Ideal) | O(log n) with O(n) processors. | O(log² n) with O(n) processors. | O(log n) with O(n) processors. |
| Key Challenge | Efficiently parallelizing the merge step. | Higher number of total comparisons; requires input size to be a power of two. | Selecting good splitters to ensure load balance; high communication cost in redistribution phase. |
| Best Suited Architecture | Shared-memory multi-core systems. | SIMD architectures, GPUs, FPGAs, and hardware sorting networks. | Distributed-memory clusters and supercomputers. |
Table 8.4.1: Comparative Analysis of Parallel Sorting Algorithms. This table summarizes the core concepts, performance characteristics, and ideal architectural contexts for Parallel Merge Sort, Bitonic Sort, and Sample Sort, illustrating the diverse strategies for tackling this fundamental problem.3
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Which parallel sorting algorithm has the best average case performance? - Stack Overflow, accessed October 6, 2025, https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance ↩
-
Parallel Merge Sort - San Jose State University, accessed October 6, 2025, https://www.sjsu.edu/people/robert.chun/courses/cs159/s3/T.pdf ↩
-
Parallel Merge Sort | Zaid Humayun’s Blog, accessed October 6, 2025, https://redixhumayun.github.io/systems/2023/12/29/parallel-merge-sort.html ↩ ↩2
-
Overview Parallel Merge Sort, accessed October 6, 2025, https://stanford.edu/~rezab/classes/cme323/S16/notes/Lecture04/cme323_lec4.pdf ↩
-
Parallel Merge Sort Algorithm. Introduction | by Rachit Vasudeva …, accessed October 6, 2025, https://rachitvasudeva.medium.com/parallel-merge-sort-algorithm-e8175ab60e7 ↩
-
What is Bitonic sort? - Educative.io, accessed October 6, 2025, https://www.educative.io/answers/what-is-bitonic-sort ↩
-
Bitonic sorter - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Bitonic_sorter ↩ ↩2 ↩3
-
Bitonic Sort - GeeksforGeeks, accessed October 6, 2025, https://www.geeksforgeeks.org/dsa/bitonic-sort/ ↩
-
Samplesort - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Samplesort ↩ ↩2 ↩3 ↩4 ↩5
-
12.7.4 Quicksort or Samplesort Algorithm - The Netlib, accessed October 6, 2025, https://www.netlib.org/utk/lsi/pcwLSI/text/node302.html ↩