8.4 Parallel Sorting Algorithms

Sorting is a fundamental problem in computer science that serves as a critical component in many applications and as a canonical benchmark for evaluating the performance of parallel algorithms and architectures. The goal of parallel sorting is to reduce the time required to sort a large dataset by distributing the work across multiple processors. While sequential comparison-based sorting algorithms have a time complexity of O(n log n), a parallel algorithm with p processors can ideally achieve a time complexity of O((n log n)/p).¹ This case study examines three distinct parallel sorting algorithms—Parallel Merge Sort, Bitonic Sort, and Sample Sort—each illustrating a different design strategy and highlighting the trade-offs between computation, communication, and architectural suitability.

Parallel Merge Sort

Merge sort’s inherent divide-and-conquer structure makes it a natural candidate for parallelization using the Fork/Join pattern.² The algorithm can be parallelized at two key stages: the recursive subdivision and the merging process.

Methodology

A direct parallelization of merge sort involves executing the two recursive calls in parallel. For a given array, the algorithm splits it into two halves. One processor sorts the left half while another simultaneously sorts the right half.³ This process continues until the sub-arrays are small enough for sequential sorting. The primary challenge is the parallel merge step, where two sorted sub-arrays must be combined into a single sorted array.

A simple sequential merge creates a bottleneck. An advanced parallel merge algorithm can have each of the p available processors take an element from one sorted sub-array (A) and use binary search to find its correct position in the other sub-array (B). The final position of the element in the merged output is the sum of its indices in A and B. By performing this for all elements in parallel, the merge step can be executed efficiently.⁴ A scalable implementation for distributed memory often involves each of the p processors sorting a local chunk of n/p elements, followed by a tree-based merging phase where pairs of processors merge their sorted lists in successive rounds.⁵

Algorithm: ParallelMergeSort(A[1..n], p processors)
Input: Array A of n elements, p processors
Output: Sorted array A

function ParallelMerge(A, B, C, p)
    // Parallel merge of sorted arrays A and B into C
    n_a = length(A)
    n_b = length(B)

    parallel for i = 1 to n_a using p/2 processors do
        // Find position of A[i] in B using binary search
        pos_in_B = BinarySearch(B, A[i])
        // Place A[i] in final position
        C[i + pos_in_B] = A[i]
    end parallel

    parallel for j = 1 to n_b using p/2 processors do
        // Find position of B[j] in A using binary search
        pos_in_A = BinarySearch(A, B[j])
        // Place B[j] in final position
        C[j + pos_in_A] = B[j]
    end parallel

function ParallelMergeSortRecursive(A, start, end, threshold)
    if (end - start) <= threshold then
        SequentialSort(A, start, end)
        return
    end

    mid = (start + end) / 2

    // Fork: sort two halves in parallel
    parallel
        spawn ParallelMergeSortRecursive(A, start, mid, threshold)
        spawn ParallelMergeSortRecursive(A, mid+1, end, threshold)
    end parallel

    // Join: merge sorted halves
    ParallelMerge(A[start..mid], A[mid+1..end], A[start..end], p)

// Main algorithm
chunk_size = n / p

// Phase 1: Each processor sorts its local chunk
parallel for processor_id = 0 to p-1 do
    start = processor_id * chunk_size
    end = start + chunk_size - 1
    SequentialSort(A, start, end)
end parallel

// Phase 2: Tree-based parallel merge
for level = 1 to log(p) do
    step = 2^level
    parallel for i = 0 to p-1 step step do
        if i + step/2 < p then
            merge_start = i * chunk_size
            merge_mid = (i + step/2) * chunk_size
            merge_end = min((i + step) * chunk_size, n)
            ParallelMerge(A[merge_start..merge_mid-1],
                         A[merge_mid..merge_end-1],
                         A[merge_start..merge_end-1], p)
        end
    end parallel
end

return A

Time Complexity: O((n/p)log(n/p) + n log p) = O((n log n)/p + n log p) Communication: O(log p) merge phases

Bitonic Sort

Bitonic sort is a parallel sorting algorithm based on the concept of a sorting network.⁶ A key feature of this algorithm is that the sequence of comparisons is fixed and does not depend on the input data, making it a data-oblivious algorithm. This property makes it well-suited for implementation in hardware and on highly parallel, synchronous architectures like GPUs.⁷

Methodology

The algorithm operates in two main phases, requiring the input size n to be a power of two:

Building a Bitonic Sequence: The algorithm first transforms the unsorted input into a bitonic sequence—a sequence that first monotonically increases and then monotonically decreases, or a circular shift thereof. This is done by recursively sorting smaller sub-arrays, half in ascending order and half in descending order, and then concatenating them.
Bitonic Merging: The bitonic sequence is then sorted using a bitonic merge operation. This operation recursively compares and swaps elements at a distance of n/2, then n/4, and so on, until all elements are sorted. For a bitonic sequence a, comparing and swapping a[i] with a[i + n/2] for all i < n/2 results in two smaller bitonic sequences, where all elements in the first half are smaller than all elements in the second half. These two halves are then sorted recursively in parallel.

The parallel time complexity of bitonic sort is O(log² n) using O(n) processors.⁷ While it performs more total comparisons than algorithms like merge sort, its highly regular and data-independent communication pattern allows for efficient implementation on SIMD-style architectures where processors execute instructions in lock-step.⁸

Algorithm: BitonicSort(A[0..n-1], p processors)
Input: Array A of n elements (n must be power of 2)
Output: Sorted array in ascending order

function BitonicMerge(A, low, count, direction)
    // direction: 1 for ascending, 0 for descending
    if count > 1 then
        k = count / 2

        // Compare and swap elements at distance k
        parallel for i = low to low + k - 1 do
            if (direction == 1 and A[i] > A[i+k]) or
               (direction == 0 and A[i] < A[i+k]) then
                swap(A[i], A[i+k])
            end
        end parallel

        // Recursively merge both halves
        parallel
            BitonicMerge(A, low, k, direction)
            BitonicMerge(A, low + k, k, direction)
        end parallel
    end

function BitonicSortRecursive(A, low, count, direction)
    if count > 1 then
        k = count / 2

        // Sort first half in ascending order
        // Sort second half in descending order
        parallel
            BitonicSortRecursive(A, low, k, 1)
            BitonicSortRecursive(A, low + k, k, 0)
        end parallel

        // Merge the resulting bitonic sequence
        BitonicMerge(A, low, count, direction)
    end

// Main execution
BitonicSortRecursive(A, 0, n, 1)

return A

Example execution for n=8:

Level 1: Create 4 bitonic sequences of size 2
Level 2: Merge into 2 bitonic sequences of size 4
Level 3: Final merge into 1 sorted sequence of size 8

Total depth: log²(n) comparisons Parallelism: All comparisons at same distance can execute simultaneously

Sample Sort

Sample sort is a parallel sorting algorithm designed for distributed-memory systems. It is a generalization of quicksort that uses multiple pivots, called splitters, to partition the data into multiple buckets, addressing the load balancing challenge that affects simple parallel quicksort implementations.⁹

Methodology

The algorithm proceeds in several distinct phases:⁹

Sampling: Each of the p processors randomly selects a small number of samples from its local data. The number of samples is often chosen using an oversampling factor k, such that each processor selects k(p−1) samples.
Splitter Selection: The samples from all processors are gathered to one processor (or all processors). These samples are sorted, and p−1 splitters are chosen from the sorted list at regular intervals. These splitters are intended to represent the overall distribution of the input data.
Data Partitioning (Bucketing): The p−1 splitters are broadcast to all processors. Each processor then uses these splitters to partition its local data into p buckets. Bucket j on processor i contains all local elements that fall between splitter j−1 and splitter j.
Data Redistribution: A global all-to-all communication phase occurs. Each processor i sends its j-th bucket to processor j. After this step, processor j holds all elements from the entire dataset that belong in the j-th bucket.
Local Sort: Each processor sorts the data in its final bucket using a fast sequential sorting algorithm.
Concatenation: Because the splitters divide the entire key range, the sorted data across the processors is now globally sorted. The concatenation of the sorted buckets from processor 0 to p−1 forms the final sorted array.

The success of sample sort depends on the quality of the splitters. By oversampling, the algorithm increases the probability that the splitters will partition the data into nearly equal-sized buckets, thus ensuring a balanced workload in the final local sort phase.⁹ This upfront investment in sampling and communication to achieve load balance makes sample sort highly scalable on distributed systems.

Algorithm: SampleSort(A[1..n], p processors)
Input: Array A distributed across p processors
Output: Globally sorted array

oversampling_factor = 10
samples_per_processor = oversampling_factor * (p - 1)

// Phase 1: Each processor samples its local data
parallel for each processor i = 0 to p-1 do
    local_samples[i] = randomly_select(samples_per_processor, A[i])
end parallel

// Phase 2: Select splitters from all samples
all_samples = gather_and_sort(local_samples)
for k = 1 to p-1 do
    splitters[k] = all_samples[k * samples_per_processor * oversampling_factor]
end
broadcast(splitters)

// Phase 3: Partition local data into buckets
parallel for each processor i = 0 to p-1 do
    buckets[i][0..p-1] = empty
    for each element in A[i] do
        bucket_id = BinarySearch(splitters, element)
        buckets[i][bucket_id].append(element)
    end
end parallel

// Phase 4: Exchange buckets (all-to-all communication)
parallel for each processor i = 0 to p-1 do
    for j = 0 to p-1 do
        send(buckets[i][j], to_processor=j)
    end
    new_local_data[i] = receive_from_all_processors()
end parallel

// Phase 5: Each processor sorts its received data
parallel for each processor i = 0 to p-1 do
    SequentialSort(new_local_data[i])
end parallel

return concatenate(new_local_data[0], ..., new_local_data[p-1])

Time Complexity:

Sampling: $O(\text{samples per processor})$
Local sort: $O((n/p) \log(n/p))$
All-to-all: $O(n/p)$ communication
Total: $O((n/p) \log(n/p) + n/p)$

Load Balance: Depends on quality of splitters (improved by oversampling)

The choice among these algorithms reveals a fundamental trade-off between generality and architectural specialization. Parallel merge sort is a general-purpose algorithm that performs well on shared-memory systems. In contrast, bitonic sort and sample sort are more specialized. Bitonic sort’s data-oblivious, regular structure is optimized for the synchronous, massively parallel nature of GPUs and custom hardware.⁷ Sample sort is tailored for distributed-memory clusters, with a design engineered to overcome high network latency and the need for load balancing across independent nodes.⁹ This demonstrates that in parallel computing, the optimal algorithm is a function of the target hardware.

Furthermore, this case study highlights a crucial difference between sequential and parallel design. While sequential sorting algorithms focus on minimizing comparisons, the primary challenge in scalable parallel sorting is often communication and load balancing. A simple parallel quicksort can fail if a poor pivot choice sends most of the data to one processor, creating a sequential bottleneck.¹⁰ Sample sort’s design acknowledges this problem; its sampling and splitter selection phases are an upfront cost paid to ensure a balanced distribution of work, thereby minimizing processor idle time and maximizing parallel efficiency.⁹

Feature	Parallel Merge Sort	Bitonic Sort	Sample Sort
Core Idea	Recursive “divide and conquer” where sub-problems are sorted in parallel and then merged.	A data-oblivious sorting network that builds and merges bitonic sequences.	A generalization of quicksort that uses multiple “splitters” to partition data into buckets for load balancing.
Parallelism Strategy	Task Parallelism (Fork/Join).	Data Parallelism (SIMD).	Data Parallelism with coordinated phases.
Time Complexity (Ideal)	O(log n) with O(n) processors.	O(log² n) with O(n) processors.	O(log n) with O(n) processors.
Key Challenge	Efficiently parallelizing the merge step.	Higher number of total comparisons; requires input size to be a power of two.	Selecting good splitters to ensure load balance; high communication cost in redistribution phase.
Best Suited Architecture	Shared-memory multi-core systems.	SIMD architectures, GPUs, FPGAs, and hardware sorting networks.	Distributed-memory clusters and supercomputers.

Table 8.4.1: Comparative Analysis of Parallel Sorting Algorithms. This table summarizes the core concepts, performance characteristics, and ideal architectural contexts for Parallel Merge Sort, Bitonic Sort, and Sample Sort, illustrating the diverse strategies for tackling this fundamental problem.³

References

Which parallel sorting algorithm has the best average case performance? - Stack Overflow, accessed October 6, 2025, https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance ↩
Parallel Merge Sort - San Jose State University, accessed October 6, 2025, https://www.sjsu.edu/people/robert.chun/courses/cs159/s3/T.pdf ↩
Parallel Merge Sort | Zaid Humayun’s Blog, accessed October 6, 2025, https://redixhumayun.github.io/systems/2023/12/29/parallel-merge-sort.html ↩ ↩²
Overview Parallel Merge Sort, accessed October 6, 2025, https://stanford.edu/~rezab/classes/cme323/S16/notes/Lecture04/cme323_lec4.pdf ↩
Parallel Merge Sort Algorithm. Introduction | by Rachit Vasudeva …, accessed October 6, 2025, https://rachitvasudeva.medium.com/parallel-merge-sort-algorithm-e8175ab60e7 ↩
What is Bitonic sort? - Educative.io, accessed October 6, 2025, https://www.educative.io/answers/what-is-bitonic-sort ↩
Bitonic sorter - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Bitonic_sorter ↩ ↩² ↩³
Bitonic Sort - GeeksforGeeks, accessed October 6, 2025, https://www.geeksforgeeks.org/dsa/bitonic-sort/ ↩
Samplesort - Wikipedia, accessed October 6, 2025, https://en.wikipedia.org/wiki/Samplesort ↩ ↩² ↩³ ↩⁴ ↩⁵
12.7.4 Quicksort or Samplesort Algorithm - The Netlib, accessed October 6, 2025, https://www.netlib.org/utk/lsi/pcwLSI/text/node302.html ↩

8.4 Parallel Sorting Algorithms

Parallel Merge Sort

Methodology

Bitonic Sort

Methodology

Sample Sort

Methodology

References

Footnotes