Skip to content

8.4 Parallel Sorting Algorithms

Sorting is a fundamental problem in computer science that serves as a critical component in many applications and as a canonical benchmark for evaluating the performance of parallel algorithms and architectures. The goal of parallel sorting is to reduce the time required to sort a large dataset by distributing the work across multiple processors. While sequential comparison-based sorting algorithms have a time complexity of O(n log n), a parallel algorithm with p processors can ideally achieve a time complexity of O((n log n)/p).1 This case study examines three distinct parallel sorting algorithms—Parallel Merge Sort, Bitonic Sort, and Sample Sort—each illustrating a different design strategy and highlighting the trade-offs between computation, communication, and architectural suitability.

Merge sort’s inherent divide-and-conquer structure makes it a natural candidate for parallelization using the Fork/Join pattern.2 The algorithm can be parallelized at two key stages: the recursive subdivision and the merging process.

A direct parallelization of merge sort involves executing the two recursive calls in parallel. For a given array, the algorithm splits it into two halves. One processor sorts the left half while another simultaneously sorts the right half.3 This process continues until the sub-arrays are small enough for sequential sorting. The primary challenge is the parallel merge step, where two sorted sub-arrays must be combined into a single sorted array.

A simple sequential merge creates a bottleneck. An advanced parallel merge algorithm can have each of the p available processors take an element from one sorted sub-array (A) and use binary search to find its correct position in the other sub-array (B). The final position of the element in the merged output is the sum of its indices in A and B. By performing this for all elements in parallel, the merge step can be executed efficiently.4 A scalable implementation for distributed memory often involves each of the p processors sorting a local chunk of n/p elements, followed by a tree-based merging phase where pairs of processors merge their sorted lists in successive rounds.5

Algorithm: ParallelMergeSort(A[1..n], p processors)
Input: Array A of n elements, p processors
Output: Sorted array A
function ParallelMerge(A, B, C, p)
// Parallel merge of sorted arrays A and B into C
n_a = length(A)
n_b = length(B)
parallel for i = 1 to n_a using p/2 processors do
// Find position of A[i] in B using binary search
pos_in_B = BinarySearch(B, A[i])
// Place A[i] in final position
C[i + pos_in_B] = A[i]
end parallel
parallel for j = 1 to n_b using p/2 processors do
// Find position of B[j] in A using binary search
pos_in_A = BinarySearch(A, B[j])
// Place B[j] in final position
C[j + pos_in_A] = B[j]
end parallel
function ParallelMergeSortRecursive(A, start, end, threshold)
if (end - start) <= threshold then
SequentialSort(A, start, end)
return
end
mid = (start + end) / 2
// Fork: sort two halves in parallel
parallel
spawn ParallelMergeSortRecursive(A, start, mid, threshold)
spawn ParallelMergeSortRecursive(A, mid+1, end, threshold)
end parallel
// Join: merge sorted halves
ParallelMerge(A[start..mid], A[mid+1..end], A[start..end], p)
// Main algorithm
chunk_size = n / p
// Phase 1: Each processor sorts its local chunk
parallel for processor_id = 0 to p-1 do
start = processor_id * chunk_size
end = start + chunk_size - 1
SequentialSort(A, start, end)
end parallel
// Phase 2: Tree-based parallel merge
for level = 1 to log(p) do
step = 2^level
parallel for i = 0 to p-1 step step do
if i + step/2 < p then
merge_start = i * chunk_size
merge_mid = (i + step/2) * chunk_size
merge_end = min((i + step) * chunk_size, n)
ParallelMerge(A[merge_start..merge_mid-1],
A[merge_mid..merge_end-1],
A[merge_start..merge_end-1], p)
end
end parallel
end
return A

Time Complexity: O((n/p)log(n/p) + n log p) = O((n log n)/p + n log p) Communication: O(log p) merge phases

Bitonic sort is a parallel sorting algorithm based on the concept of a sorting network.6 A key feature of this algorithm is that the sequence of comparisons is fixed and does not depend on the input data, making it a data-oblivious algorithm. This property makes it well-suited for implementation in hardware and on highly parallel, synchronous architectures like GPUs.7

The algorithm operates in two main phases, requiring the input size n to be a power of two:

  1. Building a Bitonic Sequence: The algorithm first transforms the unsorted input into a bitonic sequence—a sequence that first monotonically increases and then monotonically decreases, or a circular shift thereof. This is done by recursively sorting smaller sub-arrays, half in ascending order and half in descending order, and then concatenating them.
  2. Bitonic Merging: The bitonic sequence is then sorted using a bitonic merge operation. This operation recursively compares and swaps elements at a distance of n/2, then n/4, and so on, until all elements are sorted. For a bitonic sequence a, comparing and swapping a[i] with a[i + n/2] for all i < n/2 results in two smaller bitonic sequences, where all elements in the first half are smaller than all elements in the second half. These two halves are then sorted recursively in parallel.

The parallel time complexity of bitonic sort is O(log² n) using O(n) processors.7 While it performs more total comparisons than algorithms like merge sort, its highly regular and data-independent communication pattern allows for efficient implementation on SIMD-style architectures where processors execute instructions in lock-step.8

Algorithm: BitonicSort(A[0..n-1], p processors)
Input: Array A of n elements (n must be power of 2)
Output: Sorted array in ascending order
function BitonicMerge(A, low, count, direction)
// direction: 1 for ascending, 0 for descending
if count > 1 then
k = count / 2
// Compare and swap elements at distance k
parallel for i = low to low + k - 1 do
if (direction == 1 and A[i] > A[i+k]) or
(direction == 0 and A[i] < A[i+k]) then
swap(A[i], A[i+k])
end
end parallel
// Recursively merge both halves
parallel
BitonicMerge(A, low, k, direction)
BitonicMerge(A, low + k, k, direction)
end parallel
end
function BitonicSortRecursive(A, low, count, direction)
if count > 1 then
k = count / 2
// Sort first half in ascending order
// Sort second half in descending order
parallel
BitonicSortRecursive(A, low, k, 1)
BitonicSortRecursive(A, low + k, k, 0)
end parallel
// Merge the resulting bitonic sequence
BitonicMerge(A, low, count, direction)
end
// Main execution
BitonicSortRecursive(A, 0, n, 1)
return A

Example execution for n=8:

  • Level 1: Create 4 bitonic sequences of size 2
  • Level 2: Merge into 2 bitonic sequences of size 4
  • Level 3: Final merge into 1 sorted sequence of size 8

Total depth: log²(n) comparisons Parallelism: All comparisons at same distance can execute simultaneously

Sample sort is a parallel sorting algorithm designed for distributed-memory systems. It is a generalization of quicksort that uses multiple pivots, called splitters, to partition the data into multiple buckets, addressing the load balancing challenge that affects simple parallel quicksort implementations.9

The algorithm proceeds in several distinct phases:9

  1. Sampling: Each of the p processors randomly selects a small number of samples from its local data. The number of samples is often chosen using an oversampling factor k, such that each processor selects k(p−1) samples.
  2. Splitter Selection: The samples from all processors are gathered to one processor (or all processors). These samples are sorted, and p−1 splitters are chosen from the sorted list at regular intervals. These splitters are intended to represent the overall distribution of the input data.
  3. Data Partitioning (Bucketing): The p−1 splitters are broadcast to all processors. Each processor then uses these splitters to partition its local data into p buckets. Bucket j on processor i contains all local elements that fall between splitter j−1 and splitter j.
  4. Data Redistribution: A global all-to-all communication phase occurs. Each processor i sends its j-th bucket to processor j. After this step, processor j holds all elements from the entire dataset that belong in the j-th bucket.
  5. Local Sort: Each processor sorts the data in its final bucket using a fast sequential sorting algorithm.
  6. Concatenation: Because the splitters divide the entire key range, the sorted data across the processors is now globally sorted. The concatenation of the sorted buckets from processor 0 to p−1 forms the final sorted array.

The success of sample sort depends on the quality of the splitters. By oversampling, the algorithm increases the probability that the splitters will partition the data into nearly equal-sized buckets, thus ensuring a balanced workload in the final local sort phase.9 This upfront investment in sampling and communication to achieve load balance makes sample sort highly scalable on distributed systems.

Algorithm: SampleSort(A[1..n], p processors)
Input: Array A distributed across p processors
Output: Globally sorted array
oversampling_factor = 10
samples_per_processor = oversampling_factor * (p - 1)
// Phase 1: Each processor samples its local data
parallel for each processor i = 0 to p-1 do
local_samples[i] = randomly_select(samples_per_processor, A[i])
end parallel
// Phase 2: Select splitters from all samples
all_samples = gather_and_sort(local_samples)
for k = 1 to p-1 do
splitters[k] = all_samples[k * samples_per_processor * oversampling_factor]
end
broadcast(splitters)
// Phase 3: Partition local data into buckets
parallel for each processor i = 0 to p-1 do
buckets[i][0..p-1] = empty
for each element in A[i] do
bucket_id = BinarySearch(splitters, element)
buckets[i][bucket_id].append(element)
end
end parallel
// Phase 4: Exchange buckets (all-to-all communication)
parallel for each processor i = 0 to p-1 do
for j = 0 to p-1 do
send(buckets[i][j], to_processor=j)
end
new_local_data[i] = receive_from_all_processors()
end parallel
// Phase 5: Each processor sorts its received data
parallel for each processor i = 0 to p-1 do
SequentialSort(new_local_data[i])
end parallel
return concatenate(new_local_data[0], ..., new_local_data[p-1])

Time Complexity:

  • Sampling: O(samples per processor)O(\text{samples per processor})
  • Local sort: O((n/p)log(n/p))O((n/p) \log(n/p))
  • All-to-all: O(n/p)O(n/p) communication
  • Total: O((n/p)log(n/p)+n/p)O((n/p) \log(n/p) + n/p)

Load Balance: Depends on quality of splitters (improved by oversampling)

The choice among these algorithms reveals a fundamental trade-off between generality and architectural specialization. Parallel merge sort is a general-purpose algorithm that performs well on shared-memory systems. In contrast, bitonic sort and sample sort are more specialized. Bitonic sort’s data-oblivious, regular structure is optimized for the synchronous, massively parallel nature of GPUs and custom hardware.7 Sample sort is tailored for distributed-memory clusters, with a design engineered to overcome high network latency and the need for load balancing across independent nodes.9 This demonstrates that in parallel computing, the optimal algorithm is a function of the target hardware.

Furthermore, this case study highlights a crucial difference between sequential and parallel design. While sequential sorting algorithms focus on minimizing comparisons, the primary challenge in scalable parallel sorting is often communication and load balancing. A simple parallel quicksort can fail if a poor pivot choice sends most of the data to one processor, creating a sequential bottleneck.10 Sample sort’s design acknowledges this problem; its sampling and splitter selection phases are an upfront cost paid to ensure a balanced distribution of work, thereby minimizing processor idle time and maximizing parallel efficiency.9

FeatureParallel Merge SortBitonic SortSample Sort
Core IdeaRecursive “divide and conquer” where sub-problems are sorted in parallel and then merged.A data-oblivious sorting network that builds and merges bitonic sequences.A generalization of quicksort that uses multiple “splitters” to partition data into buckets for load balancing.
Parallelism StrategyTask Parallelism (Fork/Join).Data Parallelism (SIMD).Data Parallelism with coordinated phases.
Time Complexity (Ideal)O(log n) with O(n) processors.O(log² n) with O(n) processors.O(log n) with O(n) processors.
Key ChallengeEfficiently parallelizing the merge step.Higher number of total comparisons; requires input size to be a power of two.Selecting good splitters to ensure load balance; high communication cost in redistribution phase.
Best Suited ArchitectureShared-memory multi-core systems.SIMD architectures, GPUs, FPGAs, and hardware sorting networks.Distributed-memory clusters and supercomputers.

Table 8.4.1: Comparative Analysis of Parallel Sorting Algorithms. This table summarizes the core concepts, performance characteristics, and ideal architectural contexts for Parallel Merge Sort, Bitonic Sort, and Sample Sort, illustrating the diverse strategies for tackling this fundamental problem.3

  1. Which parallel sorting algorithm has the best average case performance? - Stack Overflow, accessed October 6, 2025, https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance

  2. Parallel Merge Sort - San Jose State University, accessed October 6, 2025, https://www.sjsu.edu/people/robert.chun/courses/cs159/s3/T.pdf

  3. Parallel Merge Sort | Zaid Humayun’s Blog, accessed October 6, 2025, https://redixhumayun.github.io/systems/2023/12/29/parallel-merge-sort.html 2

  4. Overview Parallel Merge Sort, accessed October 6, 2025, https://stanford.edu/~rezab/classes/cme323/S16/notes/Lecture04/cme323_lec4.pdf

  5. Parallel Merge Sort Algorithm. Introduction | by Rachit Vasudeva …, accessed October 6, 2025, https://rachitvasudeva.medium.com/parallel-merge-sort-algorithm-e8175ab60e7

  6. What is Bitonic sort? - Educative.io, accessed October 6, 2025, https://www.educative.io/answers/what-is-bitonic-sort

  7. Bitonic sorter - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Bitonic_sorter 2 3

  8. Bitonic Sort - GeeksforGeeks, accessed October 6, 2025, https://www.geeksforgeeks.org/dsa/bitonic-sort/

  9. Samplesort - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Samplesort 2 3 4 5

  10. 12.7.4 Quicksort or Samplesort Algorithm - The Netlib, accessed October 6, 2025, https://www.netlib.org/utk/lsi/pcwLSI/text/node302.html