5.2 Parallelism in Storage
The evolution of parallelism in persistent storage mirrors the journey of main memory, but with its own unique set of challenges rooted in the physics of data preservation and retrieval. The narrative begins in an era dominated by mechanical devices, where physical motion was the primary performance bottleneck, necessitating the invention of system-level parallelism to aggregate the performance of multiple drives. The advent of solid-state technology eliminated the mechanical constraints, but in doing so, it revealed a deep, multi-layered internal parallelism that the legacy protocols of the mechanical age were ill-equipped to handle. This catalyzed a complete reinvention of the storage interface, culminating in a modern architecture where parallelism extends from the host CPU, through the protocol, and deep into the silicon of the storage device itself.
5.2.1 The Mechanical Age: The Tyranny of Latency in Hard Disk Drives (HDDs)
Section titled “5.2.1 The Mechanical Age: The Tyranny of Latency in Hard Disk Drives (HDDs)”For nearly half a century, the primary medium for persistent data storage was the Hard Disk Drive (HDD). The performance of these devices was fundamentally dictated not by electronics, but by mechanics. An HDD stores data as magnetic patterns on spinning platters, and accessing that data requires a physical act: a read/write head, mounted on an actuator arm, must be moved to the correct concentric circle (the track), and then must wait for the platter to rotate until the desired sector is underneath it.1 This process introduced two significant sources of latency, measured in milliseconds (ms)—an eternity in the world of microsecond-speed processors:
- Seek Time: The time required for the actuator arm to move the head to the correct track.
- Rotational Latency: The time spent waiting for the platter to spin to the correct sector. For a 7,200 RPM drive, the average rotational latency is over 4 ms.2
These mechanical latencies created a massive “I/O wall,” a performance bottleneck orders of magnitude more severe than the memory wall.1 A single HDD could only service one request at a time, and each random request incurred these substantial mechanical delays. Sequential performance was significantly better, as the head could read contiguous sectors without seeking, but random I/O workloads, common in databases and operating systems, were cripplingly slow. The physical constraints of early HDDs were equally daunting. The first commercial HDD, the IBM 350 RAMAC introduced in 1956, weighed over a ton, filled a room, and stored a mere 5 megabytes of data on fifty 24-inch platters.3 While density improved dramatically over the decades, the fundamental mechanical principles remained. A drive could only spin so fast before the platters would warp, and the actuator arm’s precision was limited by vibration and mechanical tolerances.1 This inherent limitation—that a single drive was fundamentally constrained by physics and could only perform one mechanical action at a time—framed the problem that the next generation of storage solutions had to solve. If a single drive could not be made significantly faster for random workloads, the only viable path to higher performance and reliability was to harness the power of multiple drives working in parallel.
5.2.2 System-Level Parallelism: Redundant Array of Independent Disks (RAID)
Section titled “5.2.2 System-Level Parallelism: Redundant Array of Independent Disks (RAID)”In the late 1980s, as the inadequacy of single-drive reliability and performance became a critical issue for large-scale data systems, a groundbreaking concept emerged from a paper at the University of California, Berkeley: the Redundant Array of Inexpensive Disks (RAID).4 RAID was a system-level solution that combined multiple physical HDDs into a single logical unit, using parallelism to achieve levels of performance and fault tolerance that were impossible with any individual drive.5 RAID introduced several core techniques that could be combined in different ways to create various “RAID levels,” each offering a unique trade-off between performance, redundancy, and cost.6 The fundamental techniques are:
- Striping (RAID 0): Data is broken down into blocks, or “stripes,” which are then written sequentially across all the drives in the array. When the host system requests a large file, the read operation can occur in parallel across all drives, theoretically multiplying the sequential throughput by the number of drives in the array (N). This configuration offers the highest performance but provides no fault tolerance; the failure of a single drive results in the loss of all data in the array.5
- Mirroring (RAID 1): Data is duplicated exactly on two or more disks. Every write operation is performed on all disks in the mirror set simultaneously. This provides excellent redundancy, as the array can survive the failure of all but one drive. While write performance is limited to that of a single drive (or slightly less due to overhead), random read performance can be significantly improved, as a read request can be serviced by any drive in the set, allowing for parallel servicing of multiple read requests.5
- Parity (RAID 5 and RAID 6): Instead of storing a full duplicate copy of the data, parity-based RAID uses a mathematical checksum to provide redundancy. In RAID 5, data is striped across the drives, and a single parity block is calculated and stored for each stripe. This parity information is distributed across all drives in the array. If one drive fails, its data can be reconstructed on-the-fly by performing an XOR operation on the data from the surviving drives and the corresponding parity block.5 RAID 6 extends this concept by calculating and storing two independent parity blocks, allowing it to withstand the failure of any two drives.7
Parity-based RAID levels introduce a significant write penalty. A simple logical write from the host requires a more complex “read-modify-write” sequence at the physical level. The RAID controller must read the old data block, read the old parity block, calculate the new parity, write the new data block, and finally write the new parity block. This can translate a single host write into four or more physical I/O operations, significantly impacting random write performance.8 These techniques were combined to create nested levels like RAID 10 (or 1+0), which creates a stripe across multiple mirrored sets, offering both the high read performance of striping and the redundancy of mirroring, making it a popular choice for high-performance databases.6 The following table summarizes the key characteristics and trade-offs of these standard RAID configurations. Table_4_2: Performance and Redundancy Characteristics of Standard RAID Levels
Feature | RAID 0 | RAID 1 | RAID 5 | RAID 6 | RAID 10 | |
---|---|---|---|---|---|---|
Technique | Striping | Mirroring | Striping with Distributed Parity | Striping with Dual Parity | Stripe of Mirrors | |
Min. Drives | 2 | 2 | 3 | 4 | 4 | |
Fault Tolerance | None | 1 drive failure | 1 drive failure | 2 drive failures | Up to 1 drive per mirror | |
Read Performance | Excellent (N×X) | Good (N×X) | Excellent (N×X) | Excellent (N×X) | Excellent (N×X) | |
Write Performance | Excellent (N×X) | Poor (N×X/2) | Fair (N×X/4) | Poor (N×X/6) | Good (N×X/2) | |
Capacity Util. | 100% | 50% | (N−1)/N | (N−2)/N | 50% | |
Use Case | Video editing, scratch disks | OS boot drives, small databases | File servers, archives (legacy) | Large archives, backups | Databases, VMs | |
Note: N = number of drives, X = performance of a single drive. Write performance penalties are approximations for random workloads. | ||||||
Data sourced from.5 |
5.2.3 The Flash Era: Uncovering Deep Internal Parallelism in SSDs
Section titled “5.2.3 The Flash Era: Uncovering Deep Internal Parallelism in SSDs”The arrival of the Solid-State Drive (SSD) marked the most significant disruption in storage technology since the invention of the HDD. By replacing spinning magnetic platters and moving actuator arms with solid-state NAND flash memory, SSDs eliminated the mechanical latency that had defined storage performance for decades.1 This transition represented a fundamental shift from system-level parallelism, which was a workaround for slow devices, to device-level parallelism, which sought to exploit the inherently parallel nature of the new storage medium. An SSD is not a monolithic block of memory. It is a highly parallel system in its own right, architecturally resembling a small, specialized distributed system. A typical SSD consists of a multi-core controller processor, a small amount of DRAM for caching metadata, and an array of NAND flash chips.9 The key to an SSD’s performance lies in the controller’s ability to access these multiple flash chips simultaneously. This is achieved through several layers of internal parallelism:
- Channel Parallelism: The SSD controller communicates with the NAND flash array via multiple independent data paths called channels. A modern enterprise SSD might have 8, 16, or even more channels. The controller can issue read, write, and erase commands to chips on different channels concurrently, allowing for true parallel I/O operations.9
- Way Parallelism (Chip Interleaving): Within a single channel, multiple flash chips can be connected. The controller can interleave operations among these chips, a technique analogous to memory interleaving. While one chip is busy with a long-latency operation (like programming a page), the controller can use the channel to transfer data to or from another chip, hiding latency and maximizing channel utilization.9
- Plane Parallelism: The parallelism extends even deeper, down to the level of a single flash die. A die is often partitioned into two or more planes, each with its own data register and page buffer. This allows the die to perform multiple operations concurrently. For example, data can be transferred from the host into the register of one plane while another plane is simultaneously performing the slow process of programming data from its register into the flash cells. This multi-plane operation effectively doubles or quadruples the performance of a single die.9
Orchestrating this complex, multi-layered parallelism is the responsibility of the SSD’s onboard firmware, known as the Flash Translation Layer (FTL). The FTL is a sophisticated piece of software that runs on the SSD’s controller. It has several critical responsibilities: it translates logical block addresses (LBAs) from the host operating system into physical page addresses distributed across the SSD’s channels, ways, and planes; it manages garbage collection to reclaim invalid pages; and it performs wear-leveling to distribute writes evenly across the flash cells to maximize the drive’s endurance.9 The effectiveness of this internal parallelism is not automatic. Research presented at leading storage conferences like USENIX FAST has shown that fully exploiting an SSD’s potential is a non-trivial challenge. The FTL must intelligently place data to enable parallel access later. Poor data placement, often caused by file fragmentation or interleaved writes from different processes, can lead to subsequent read requests targeting dies that are on the same channel or chips that are simultaneously busy. These “die-level collisions” serialize what could have been parallel operations, severely degrading performance and effectively undermining the SSD’s architectural advantages.10 This research highlights that the SSD’s performance is not just a function of its hardware, but also of the intelligence of its FTL and the nature of the I/O patterns it receives from the host.11
5.2.4 The Protocol Revolution: From SATA’s Bottleneck to NVMe’s Ascendancy
Section titled “5.2.4 The Protocol Revolution: From SATA’s Bottleneck to NVMe’s Ascendancy”The invention of the SSD was a monumental leap forward, eliminating the mechanical bottleneck of the HDD. However, in doing so, it immediately exposed the next major bottleneck in the storage stack: the communication protocol and interface. The legacy interfaces, designed in the era of slow, single-actuator hard drives, were fundamentally incapable of unleashing the massive internal parallelism of modern SSDs. For years, the dominant interface for consumer and enterprise drives was Serial ATA (SATA), which used the Advanced Host Controller Interface (AHCI) protocol. AHCI was designed circa 2004 for HDDs and was built around a crucial, and ultimately fatal, limitation: it supports only a single command queue with a maximum depth of 32 outstanding commands.12 This single-threaded, serial-minded architecture was a reasonable match for an HDD that could only physically service one request at a time. For an SSD, however, with its multiple channels, dies, and planes all capable of operating in parallel, a single queue of 32 commands was a starvation-level diet of I/O requests. Modern multi-core CPUs could generate I/O requests far faster than the AHCI protocol could submit them to the drive, leaving both the CPU cores and the SSD’s internal parallel hardware sitting idle.12 The SATA interface itself, topping out at 600 MB/s for SATA III, also became a hard ceiling on throughput that even mid-range SSDs could easily saturate.12 The solution required a complete reinvention of the storage protocol, designed from the ground up for the unique characteristics of non-volatile memory. This solution was Non-Volatile Memory Express (NVMe). Introduced in 2011, NVMe was architected to fully exploit both the low latency of flash memory and the parallelism of modern multi-core processors.12 The architectural superiority of NVMe stems from two key design choices:
- Leveraging the PCIe Bus: Instead of using the SATA bus, which requires communication through a host bus adapter (HBA), NVMe utilizes the Peripheral Component Interconnect Express (PCIe) bus. This provides a direct, low-latency path from the storage device to the CPU, slashing protocol overhead and enabling vastly higher bandwidth. A single PCIe 4.0 lane offers more bandwidth than the entire SATA III interface, and typical NVMe SSDs use four lanes (x4) for a theoretical bandwidth of nearly 8 GB/s.12
- Massively Parallel Queuing: NVMe’s most revolutionary feature is its command and queueing mechanism. It replaces AHCI’s single queue with support for up to 65,535 command queues, with each queue capable of holding up to 65,536 commands. This architecture is a perfect match for modern multi-core systems. Each CPU core can be assigned its own queue(s) without lock contention, allowing it to submit I/O requests to the drive independently and in parallel. This massive increase in queueing capability finally provides a mechanism for the host system to generate enough concurrent requests to keep all the internal parallel resources of the SSD fully occupied.12
The performance gap created by these architectural differences is immense. NVMe slashes latency, multiplies throughput, and dramatically increases the number of Input/Output Operations Per Second (IOPS) a drive can handle. The following table provides a stark comparison of the two protocols, crystallizing the transition from a serial legacy to a parallel-native future. Table_4_3: Protocol Architecture and Performance: SATA/AHCI vs. NVMe
Feature | SATA / AHCI | NVMe | Performance Implication | |
---|---|---|---|---|
Physical Interface | SATA III (6 Gb/s) | PCIe (Gen4 x4: ~64 Gb/s) | NVMe has >10x the raw interface bandwidth. | |
Protocol | AHCI (Designed for HDDs) | NVMe (Designed for Flash) | NVMe is streamlined with lower overhead. | |
Command Queues | 1 | Up to 65,535 | NVMe enables massive parallelism from multi-core CPUs. | |
Queue Depth | 32 commands | Up to 65,536 per queue | Eliminates queueing as a bottleneck for I/O requests. | |
Host Communication | Via SATA Controller | Direct to CPU via PCIe | Lower latency and reduced CPU cycles per I/O. | |
Typical Latency | ~100-500 µs | <50 µs | Dramatically improved system responsiveness. | |
Typical Throughput | ~550 MB/s | 3,500 - 7,000+ MB/s | Order-of-magnitude increase in sequential performance. | |
Typical IOPS | ~100K | 500K - 1M+ | Unlocks performance for random I/O workloads. | |
Data sourced from.2 |
5.2.5 Case Study: Transforming the Data Center with NVMe
Section titled “5.2.5 Case Study: Transforming the Data Center with NVMe”The profound impact of the architectural shift from SATA to NVMe is most evident in the modern data center, where the performance of storage directly dictates the efficiency and capability of business-critical applications. Even after the transition from HDDs to SATA-based SSDs, many data-intensive workloads—such as large-scale Online Transaction Processing (OLTP) databases, big data analytics frameworks like Apache Spark, and AI/ML training pipelines—remained fundamentally bottlenecked by storage I/O.13 The single-queue nature of SATA/AHCI created a traffic jam between powerful multi-core servers and the parallel hardware inside the SSDs, preventing applications from reaching their full potential. The adoption of NVMe SSDs has shattered this bottleneck, leading to transformative performance gains across the enterprise. Real-world studies and benchmarks quantify this impact vividly:
- In database workloads, replacing enterprise-class SATA SSDs with NVMe SSDs has been shown to deliver up to 8 times superior client-side performance.14
- Transactional database performance, measured in transactions per second, can increase by more than 2x simply by migrating from SATA to NVMe storage on the same server hardware.15
- For mixed read/write workloads common in data centers, a single high-performance NVMe SSD can deliver over 1 million IOPS, a 10x improvement over the ~100K IOPS limit of the SATA interface.16
These dramatic improvements are a direct consequence of NVMe’s parallel-native architecture. In a modern data center server with dozens of CPU cores, NVMe’s support for thousands of deep command queues allows applications to issue a massive number of concurrent I/O requests. This ensures that the SSD’s internal channels are saturated with work, fully exploiting its device-level parallelism. The result is a significant reduction in application latencies, as CPUs and GPUs spend far less time waiting for data and more time performing computation.17 This is particularly critical for latency-sensitive industries like finance, healthcare, and telecommunications, where faster transaction processing and data analysis translate directly to business value.17 The evolution of storage parallelism has now entered its next logical phase, extending beyond the individual server to the entire data center fabric. NVMe over Fabrics (NVMe-oF) is a technology that extends the low-latency, high-parallelism NVMe command set across network fabrics like Ethernet or InfiniBand.15 This allows for the creation of disaggregated storage architectures, where a large pool of high-performance NVMe SSDs can be shared efficiently among many compute servers. NVMe-oF preserves the end-to-end parallelism of the NVMe protocol, enabling applications to access remote storage with latencies that are close to that of locally attached drives.18 This technology is a key enabler for the next generation of cloud, hyper-converged, and software-defined infrastructures, demonstrating that the principle of parallelism, which began with striping data across a few hard drives, now scales to orchestrate I/O across the entire data center.
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Hard disk drive (HDD) versus Solid-state drive (SSD): What’s the difference? - IBM, accessed October 2, 2025, https://www.ibm.com/think/topics/hard-disk-drive-vs-solid-state-drive ↩ ↩2 ↩3 ↩4
-
HDD vs SATA SSD vs NVMe SSD Concepts - Advanced - Atlantic.Net, accessed October 2, 2025, https://www.atlantic.net/vps-hosting/hdd-vs-sata-ssd-vs-nvme-ssd-concepts/ ↩ ↩2
-
Is the Hard Disk Drive Obsolete? Flash vs. HDDs | ESF - Enterprise Storage Forum, accessed October 2, 2025, https://www.enterprisestorageforum.com/hardware/storage-hardware-hdd-obsolete/ ↩
-
Hard-Disk Drives: The Good, the Bad, and the Ugly - Communications of the ACM, accessed October 2, 2025, https://cacm.acm.org/practice/hard-disk-drives-the-good-the-bad-and-the-ugly/ ↩
-
Standard RAID levels - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Standard_RAID_levels ↩ ↩2 ↩3 ↩4 ↩5
-
RAID Level 0, 1, 5, 6, 10: Advantages, Disadvantages, and Uses, accessed October 2, 2025, https://www.liquidweb.com/blog/raid-level-1-5-6-10/ ↩ ↩2
-
Understanding RAID Levels, Configurations & More - NI - National Instruments, accessed October 2, 2025, https://www.ni.com/en/shop/understanding-raid.html ↩
-
Understanding RAID Performance at Various Levels | Arcserve, accessed October 2, 2025, https://www.arcserve.com/blog/understanding-raid-performance-various-levels ↩
-
Analytical Model of SSD Parallelism - KAIST OS Lab, accessed October 2, 2025, https://oslab.kaist.ac.kr/wp-content/uploads/esos_files/publication/conferences/international/VSSIM_IOSimulator.pdf ↩ ↩2 ↩3 ↩4 ↩5
-
Exploiting SSD Asymmetry and Concurrency for Storage-Intensive Applications - UMass Boston CS, accessed October 2, 2025, https://www.cs.umb.edu/~tpapon/pdfs/thesis_prospectus.pdf ↩
-
Excessive SSD-Internal Parallelism Considered Harmful, accessed October 2, 2025, https://www.hotstorage.org/2023/papers/hotstorage23-final66.pdf ↩
-
NVMe vs SATA: What is the difference? - Kingston Technology, accessed October 2, 2025, https://www.kingston.com/en/blog/pc-performance/nvme-vs-sata ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Exploring Benefits of NVMe SSDs for BigData Processing in Enterprise Data Centers, accessed October 2, 2025, https://www.researchgate.net/publication/337501121_Exploring_Benefits_of_NVMe_SSDs_for_BigData_Processing_in_Enterprise_Data_Centers ↩
-
Performance analysis of NVMe SSDs and their implication on real world databases, accessed October 2, 2025, https://www.researchgate.net/publication/300298716_Performance_analysis_of_NVMe_SSDs_and_their_implication_on_real_world_databases ↩
-
Why Replace Enterprise SATA SSDs with Data Center NVMe™ SSDs? - KIOXIA America, Inc., accessed October 2, 2025, https://americas.kioxia.com/en-ca/business/resources/top-5-reasons/replace-enterprise-sata-with-data-center-ssds.html ↩ ↩2
-
Micron 9400 NVMe SSD the new leader for data center workloads, accessed October 2, 2025, https://www.micron.com/about/blog/applications/data-center/micron-9400-nvme-ssd-for-data-center-workloads ↩
-
The Benefits of NVMe in Enterprise - Kingston Technology, accessed October 2, 2025, https://www.kingston.com/en/blog/servers-and-data-centers/the-benefits-of-nvme-in-enterprise ↩ ↩2
-
Boost Workload Performance with NVMe/TCP - Dell Technologies, accessed October 2, 2025, https://www.delltechnologies.com/asset/en-us/products/networking/industry-market/boost-workload-performance-with-nvmetcp.pdf ↩