5.1 Parallelism in Main Memory (RAM)
The evolution of parallelism within main memory represents a critical and continuous front in the war against system bottlenecks. The fundamental challenge has always been the processor-memory performance gap—a chasm that has widened with each successive generation of CPU technology. This section delves into the architectural and technological solutions developed over decades to bridge this gap. The narrative begins with the formal definition of the “memory wall” and explores the foundational concepts that first introduced parallelism into the memory subsystem. It then charts the generational refinements of Double Data Rate (DDR) memory, which scaled bandwidth for over two decades, before culminating in the revolutionary paradigm shift of 3D-stacked High Bandwidth Memory (HBM), the technology that now powers the world’s most advanced supercomputers.
5.1.1 The “Memory Wall”: A Persistent Challenge
Section titled “5.1.1 The “Memory Wall”: A Persistent Challenge”The “memory wall” is a term that has haunted computer architects for decades, encapsulating the fundamental and growing disparity between the rate of improvement in microprocessor performance and the much slower rate of improvement in Dynamic Random-Access Memory (DRAM) speed.1 Coined in a seminal 1994 paper by William Wulf and Sally McKee, the term gave a name to an alarming trend: processors were becoming faster at a much greater rate than the memory that supplied them with data, leading to a bottleneck where CPUs would increasingly sit idle, waiting for data to arrive.2 Historical data starkly illustrates this divergence. Between 1986 and 2000, CPU performance improved at an astonishing annual rate of 55%, while memory performance improved by a comparatively modest 10%.3 The advent of multi-core processors in the mid-2000s, a strategy to continue performance scaling after clock frequency scaling hit its own physical limits, only exacerbated the problem. A single chip now housed multiple powerful processing cores, all competing for access to the same shared memory subsystem, placing an even greater strain on memory bandwidth and latency.1 This challenge is not a single problem but a dual-faceted one, rooted in the two primary metrics of memory performance: latency and bandwidth.
- Latency is the time it takes to access a specific memory location and retrieve the first piece of data. It is governed by the physical characteristics of DRAM cells and the time required for row and column activation.
- Bandwidth is the rate at which data can be transferred after the initial access has been made. It is a function of the memory bus width and clock frequency.
While architectural innovations have been remarkably successful at mitigating the bandwidth problem, the latency issue has proven far more intractable. Across multiple generations of DDR SDRAM, from DDR-400 to DDR3-1600, the fundamental Column Access Strobe (CAS) latency—the time between a read command and the data output—has remained relatively constant at around 10–15 nanoseconds.4 CPU vendors have made incremental advancements to combat the bandwidth limitation, primarily by adding more memory channels to the CPU’s integrated memory controller and supporting successively faster generations of DDR memory.1 However, these have been temporary reliefs, akin to widening a highway to ease congestion without increasing the fundamental speed limit. The core issue of latency has remained a stubborn floor, a physical limitation that cannot be easily engineered away. This persistent latency floor is precisely why complex, multi-level cache hierarchies remain an indispensable component of modern processors. Caches are small, fast, and expensive SRAMs that store frequently accessed data physically closer to the processor, serving as a buffer against the high latency of main DRAM. The entire history of memory parallelism can therefore be understood as a series of sophisticated engineering strategies designed to work around this fundamental latency problem. Since the initial wait time could not be significantly reduced, architects focused relentlessly on maximizing the one variable they could control: the amount of data delivered per unit of time once the connection was established. This focus on maximizing throughput to hide inherent slowness is the driving force behind every innovation discussed in this section, from memory interleaving to High Bandwidth Memory.
5.1.2 Foundational Concepts: From PRAM to Physical Architectures
Section titled “5.1.2 Foundational Concepts: From PRAM to Physical Architectures”Before physical hardware could be engineered to deliver parallel data streams, the theoretical underpinnings of parallel memory access had to be established. The most influential abstract model in this domain is the Parallel Random-Access Machine (PRAM), developed in the late 1970s.5 The PRAM model serves as the parallel computing analogue to the sequential Random-Access Machine (RAM) model. It provides a theoretical framework for designing and analyzing parallel algorithms by assuming an idealized shared-memory machine with an unbounded number of processors that can all access any memory location in a single time step, free from resource contention or communication overhead.5 The PRAM model is further refined into several variants based on how it handles simultaneous memory accesses, providing a formal language to describe the concurrency requirements of an algorithm:
- Exclusive Read, Exclusive Write (EREW): The most restrictive model, where each memory cell can only be read or written to by a single processor at any given time.
- Concurrent Read, Exclusive Write (CREW): A more powerful model that allows multiple processors to read from the same memory cell simultaneously, but only one processor can write to it at a time.
- Concurrent Read, Concurrent Write (CRCW): The most powerful and least restrictive model, allowing multiple processors to both read from and write to the same memory location simultaneously. Further rules are needed to resolve write conflicts (e.g., which processor’s value “wins”).5
While the PRAM model is an abstraction that cannot be perfectly implemented in physical hardware due to issues like memory bank contention and communication latency, it provides an essential theoretical basis for understanding and exploiting concurrency. The first practical attempts to implement these principles in real systems focused on hiding the very real latency that the PRAM model ignores. One such technique is Memory-Level Parallelism (MLP). MLP is a microarchitectural strategy designed to service multiple cache misses in parallel. In a traditional sequential processor, when a cache miss occurs, the processor stalls for a long latency period while data is fetched from main memory. With MLP, an out-of-order execution engine can continue to process other independent instructions, and if those instructions also result in cache misses, the memory controller can issue multiple memory requests simultaneously. By overlapping the long latency stalls of several misses, the effective memory access time is significantly reduced, hiding a portion of the memory wall.6 A more foundational and widely implemented architectural technique is Memory Interleaving. This design compensates for the relatively slow cycle time of individual DRAM banks by spreading contiguous memory addresses across multiple physical banks in a round-robin fashion.7 The most common form is low-order interleaving, where the low-order bits of a memory address are used to select the memory bank, while the high-order bits select the location within that bank. The mapping is typically defined by the formula: Bank_Number=Memory_Address(modNumber_of_Banks).8 For example, in a system with four-way interleaving:
- Address 0 maps to Bank 0
- Address 1 maps to Bank 1
- Address 2 maps to Bank 2
- Address 3 maps to Bank 3
- Address 4 maps to Bank 0
This arrangement is highly effective because programs exhibit strong locality of reference, meaning they tend to access memory locations sequentially. With an interleaved architecture, a sequence of memory reads or writes will be directed to different banks in succession. While one bank is busy servicing a request and entering its precharge/recovery cycle, the memory controller can initiate a new access to the next bank in the sequence. This creates a pipelined effect, allowing multiple memory operations to be in different stages of completion simultaneously across the banks, thereby hiding individual bank latency and maximizing the overall memory throughput.7 This simple yet powerful form of parallelism became a cornerstone of modern memory systems and set the stage for the more complex techniques that would follow.
5.1.3 The DDR SDRAM Epoch: Scaling Bandwidth Through Generational Refinements
Section titled “5.1.3 The DDR SDRAM Epoch: Scaling Bandwidth Through Generational Refinements”For over two decades, the dominant technology in the fight against the memory wall was Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM). Its evolution represents a masterclass in incremental engineering, where a series of clever refinements to a core architecture delivered exponential gains in bandwidth generation after generation. The story begins with the crucial transition from asynchronous DRAM to Synchronous DRAM (SDRAM) in the 1990s.9 Asynchronous DRAM operations were controlled by timing signals from the memory controller, which was complex and inefficient. SDRAM revolutionized this by introducing a synchronous interface, tying all memory operations to a system clock signal shared with the CPU. This synchronization allowed the memory controller to pipeline commands; it could issue a new command before the previous one had finished executing, dramatically improving concurrency and effective data transfer rates.4 The next major breakthrough was the introduction of Double Data Rate (DDR) technology around the year 2000.10 The key innovation of DDR was its ability to transfer data on both the rising and falling edges of the clock signal. This simple yet profound change effectively doubled the data transfer rate of the memory bus without requiring an increase in the underlying clock frequency, which would have consumed more power and created signal integrity challenges.9 This doubling of the data rate was enabled by a crucial internal mechanism: the prefetch buffer. The internal DRAM array, composed of the actual memory cells, operates at a slower clock rate than the external I/O bus. The prefetch buffer acts as a small, high-speed cache that fetches a block of data from the array in a single internal clock cycle and then serializes it out onto the faster external bus. The size of this buffer became the primary lever for generational improvement:
- DDR SDRAM used a 2n-prefetch buffer, meaning it fetched 2 data words (e.g., 2×64 bits) from the memory array per internal clock cycle.
- DDR2 SDRAM doubled this to a 4n-prefetch buffer. This allowed the internal memory clock to run at half the speed of the I/O bus clock, reducing power consumption while still doubling the overall data rate compared to DDR.9
- DDR3 SDRAM doubled the buffer size again to 8n-prefetch, further increasing the ratio between the external data rate and the internal array speed.9
- DDR4 SDRAM maintained the 8n-prefetch architecture but introduced other improvements, such as bank groups, to improve efficiency and enable higher clock speeds.
- DDR5 SDRAM, the latest standard, effectively doubles the prefetch to 16n by using two independent 32-bit channels per module, each with an 8n-prefetch buffer.
This relentless cycle of doubling the prefetch depth, increasing clock frequencies, and reducing operating voltage allowed DDR SDRAM to scale its bandwidth exponentially over two decades. Each generation offered significantly higher performance and greater density while consuming less power, providing a reliable and cost-effective solution for the entire computing industry, from mobile phones to servers. The following table summarizes this remarkable evolutionary path. Table_4_1: Generational Comparison of DDR SDRAM Standards (DDR1-DDR5)
Generation | Year Introduced | Data Rate (MT/s) | I/O Clock (MHz) | Prefetch Buffer | Voltage | Typical Module Size | |
---|---|---|---|---|---|---|---|
DDR (DDR1) | ~2000 | 200–400 | 100–200 | 2n | 2.5 V | Up to 1 GB | |
DDR2 | ~2003 | 400–1066 | 200–533 | 4n | 1.8 V | Up to 4 GB | |
DDR3 | ~2007 | 800–2133 | 400–1066 | 8n | 1.5 V | Up to 16 GB | |
DDR4 | ~2014 | 2133–3200+ | 1066–1600 | 8n | 1.2 V | Up to 64 GB | |
DDR5 | ~2020 | 4800–8400+ | 2400–4200+ | 16n (8n x2) | 1.1 V | Up to 128 GB+ | |
Data sourced from.9 |
5.1.4 The 3D Revolution: High Bandwidth Memory (HBM)
Section titled “5.1.4 The 3D Revolution: High Bandwidth Memory (HBM)”While the DDR SDRAM epoch was characterized by steady, evolutionary progress, the relentless demands of high-performance computing (HPC) and artificial intelligence (AI) eventually pushed the traditional planar memory architecture to its physical limits. The strategy of making the memory bus progressively faster hit a wall of diminishing returns, where escalating clock frequencies led to prohibitive power consumption and complex signal integrity issues over the physical distances of a motherboard’s printed circuit board (PCB). This challenge necessitated a revolutionary paradigm shift, moving from a two-dimensional to a three-dimensional architecture: High Bandwidth Memory (HBM).11 HBM represents a fundamental rethinking of how memory and processors should be connected. Instead of pursuing higher clock speeds over a narrow bus, HBM achieves its massive performance gains by using an extremely wide, yet slower and shorter, communication interface. This is made possible by three core architectural principles:
- Stacked DRAM and Through-Silicon Vias (TSVs): The defining feature of HBM is its 3D-stacked design. Multiple DRAM dies (typically 4, 8, or even 12) are stacked vertically on top of a base logic die. These layers are interconnected by thousands of microscopic vertical conductive channels called Through-Silicon Vias (TSVs), which pass directly through the silicon of each die.11 This vertical integration allows for a density and level of interconnection that is impossible in a 2D layout.
- Ultra-Wide Memory Interface: The dense vertical wiring enabled by TSVs allows for an extraordinarily wide memory bus. A single HBM2 stack, for example, features a 1024-bit wide interface, composed of eight independent 128-bit channels.12 This is a stark contrast to a standard DDR4 or DDR5 DIMM, which has a 64-bit wide interface. With a bus that is 16 times wider, HBM can transfer an immense amount of data at a much lower clock frequency, achieving unprecedented bandwidth. A single HBM2 stack can deliver between 256 and 307 GB/s of bandwidth, far exceeding what even a multi-channel DDR system can provide.11
- Silicon Interposer and 2.5D Packaging: Connecting a 1024-bit wide bus from a memory stack to a processor on a conventional PCB is physically impractical due to the sheer number of traces required. HBM solves this with 2.5D packaging. The processor die and one or more HBM stacks are placed side-by-side on a single piece of silicon called an interposer. This interposer contains extremely fine, high-density wiring that connects the processor’s memory controller directly to the HBM stacks over a very short distance.11 This co-packaging minimizes signal travel distance, which in turn reduces latency, power consumption, and signal degradation.
The development of HBM was not merely an effort to create “faster memory”; it was a systemic architectural response to the fundamental physical constraints of power and signal integrity. The decades-long strategy of increasing data rates on the DDR bus had become unsustainable from a power-efficiency perspective, as power consumption scales with frequency and distance. HBM’s design philosophy is a direct inversion of this approach. By making the data path enormously wider and dramatically shorter, it can afford to make it slower in terms of clock frequency. This trade-off is profoundly advantageous: moving data at a lower frequency over the millimeters of silicon within an interposer is vastly more energy-efficient than pushing it at multi-gigahertz frequencies over the centimeters of copper on a PCB.11 This paradigm shift has positioned HBM as a specialized, premium memory solution for the most bandwidth-hungry applications. While DDR remains the workhorse for general-purpose computing and GDDR is optimized for the cost-performance balance required by consumer graphics cards, HBM is the undisputed choice for extreme-bandwidth scenarios such as high-end GPUs, AI accelerators, and the world’s fastest supercomputers.11 HBM’s emergence signifies a critical inflection point in computer architecture, where physical proximity and massive parallelism became more valuable than raw clock speed in the quest to tear down the memory wall.
5.1.5 Case Study: HBM in Exascale Supercomputing
Section titled “5.1.5 Case Study: HBM in Exascale Supercomputing”The transformative impact of High Bandwidth Memory is most vividly demonstrated in the realm of exascale supercomputing, where solving the world’s most complex scientific problems is often limited more by the ability to move data than by the ability to perform calculations. Two of the world’s leading supercomputers, Fugaku and Aurora, exemplify the critical role HBM plays in achieving unprecedented performance on memory-intensive workloads.
Fugaku: Maximizing Bandwidth for Scientific Discovery
Section titled “Fugaku: Maximizing Bandwidth for Scientific Discovery”The Fugaku supercomputer, located at the RIKEN Center for Computational Science in Japan, was designed with an “application-first” co-design philosophy, prioritizing performance on real-world scientific applications over synthetic benchmarks.13 At the heart of Fugaku is the Fujitsu A64FX processor, a custom Arm-based CPU engineered specifically for HPC. A key feature of the A64FX is its tight integration of memory: each processor is packaged with 32 GiB of HBM2, providing a staggering 1024 GB/s of memory bandwidth per node.14 This is more than an order of magnitude higher than the bandwidth available to typical server CPUs of its era. This massive bandwidth is the cornerstone of Fugaku’s exceptional performance. Many critical scientific simulations—such as climate modeling, molecular dynamics for drug discovery, and genomic analysis—are fundamentally memory-bound. Their performance is limited by the rate at which the CPU can fetch data from memory, not by its floating-point calculation speed. By providing over a terabyte per second of bandwidth directly to the processing cores, the A64FX ensures that its powerful vector processing units are constantly fed with data, minimizing stalls and maximizing computational efficiency. The results are evident in Fugaku’s benchmark rankings. While it achieved the #1 spot on the traditional TOP500 list (which uses the compute-intensive HPL benchmark), its dominance on memory-bandwidth-sensitive benchmarks is even more telling. Fugaku has consistently held top ranks on benchmarks like HPCG (which models sparse matrix calculations common in simulations) and Graph500 (which measures performance on large-scale data analytics), often outperforming its competitors by a significant margin.13 In direct comparisons on a range of open-source scientific applications, the A64FX architecture has demonstrated performance up to 2.3 times that of contemporary high-end x86 servers, a lead largely attributable to the effective utilization of its high-bandwidth memory.15
Aurora: Hybrid Memory for Architectural Flexibility
Section titled “Aurora: Hybrid Memory for Architectural Flexibility”The Aurora supercomputer at Argonne National Laboratory in the United States represents another approach to leveraging HBM, showcasing a hybrid memory architecture designed for flexibility. Aurora’s compute nodes are built around the Intel Xeon CPU Max Series processor, which uniquely integrates both on-package HBM2e and support for traditional off-chip DDR5 memory.16 Each dual-socket node provides 128 GB of high-bandwidth memory (64 GB per CPU) alongside 1,024 GB of larger-capacity DDR5 memory.17 This hybrid design acknowledges that not all applications have the same memory requirements. The Intel Xeon Max CPU can be configured in several modes to best suit the workload:
- Flat Mode: HBM and DDR are presented to the operating system as two distinct NUMA (Non-Uniform Memory Access) nodes. Programmers can explicitly allocate their most performance-critical data structures to the fast HBM, while using the larger DDR pool for less sensitive data. This mode offers maximum control and the highest possible performance for applications with memory footprints that fit within the 64 GB HBM capacity.17
- Cache Mode: The on-package HBM acts as a massive, transparent last-level (L4) cache for the main DDR5 memory. The hardware automatically manages the movement of data between DDR and HBM. This mode is ideal for legacy applications or those with memory footprints larger than 64 GB, as it provides a significant performance uplift by caching hot data in HBM without requiring any code changes.17
This architectural flexibility allows Aurora to tackle a diverse range of scientific challenges. For memory-bandwidth-bound codes that can be optimized to fit within the HBM capacity, the Xeon Max Series processors deliver breakthrough performance, showing gains of up to 4.8x compared to competing CPUs on real-world workloads like modeling and data analytics.16 For applications with enormous datasets that exceed the HBM capacity, Cache Mode still provides a substantial benefit, mitigating the latency and bandwidth penalties of accessing the slower DDR memory. The Aurora system, with its combination of HBM-equipped CPUs and GPUs, provides a total of 1.36 PB of CPU HBM and 10.9 PB of DDR5 memory, offering a powerful and versatile platform for exascale science.17 Together, Fugaku and Aurora demonstrate that HBM is not just an incremental improvement but a foundational technology for the exascale era. Whether implemented as the sole memory source for maximum bandwidth or as part of a hybrid system for maximum flexibility, HBM is the key to finally breaking through the memory wall for the world’s most demanding computational problems.
References
Section titled “References”Footnotes
Section titled “Footnotes”-
Micron’s Perspective on Impact of CXL on DRAM Bit Growth Rate, accessed October 2, 2025, https://assets.micron.com/adobe/assets/urn:aaid:aem:b2e25f63-85a2-44c9-b46f-717830deefa5/renditions/original/as/cxl-impact-dram-bit-growth-white-paper.pdf ↩ ↩2 ↩3
-
Understanding the “Memory Wall” - Ruturaj Patki, accessed October 2, 2025, https://blog.ruturajpatki.com/understanding-the-memory-wall/ ↩
-
Revisiting the Memory Wall - HPCwire, accessed October 2, 2025, https://www.hpcwire.com/2009/02/19/revisiting_the_memory_wall/ ↩
-
Synchronous dynamic random-access memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory ↩ ↩2
-
Parallel RAM - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Parallel_RAM ↩ ↩2 ↩3
-
What is memory level parallelism? - Quora, accessed October 2, 2025, https://www.quora.com/What-is-memory-level-parallelism ↩
-
Interleaved memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Interleaved_memory ↩ ↩2
-
Memory Interleaving: Parallel Memory Access | ML & CV Consultant - Abhik Sarkar, accessed October 2, 2025, https://www.abhik.xyz/concepts/memory/memory-interleaving ↩
-
DDR Generations: Memory Density and Speed | Synopsys Blog, accessed October 2, 2025, https://www.synopsys.com/blogs/chip-design/ddr-generations-memory-density-speed.html ↩ ↩2 ↩3 ↩4 ↩5
-
The evolution of memory technology, accessed October 2, 2025, https://media.kingston.com/kingston/pdf/ktc-blog-servers-and-data-centers-evolution-memory-technology-ebook-en.pdf ↩
-
High Bandwidth Memory: Concepts, Architecture, and Applications - Wevolver, accessed October 2, 2025, https://www.wevolver.com/article/high-bandwidth-memory ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
High Bandwidth Memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/High_Bandwidth_Memory ↩
-
(PDF) Co-Design and System for the Supercomputer “Fugaku” - ResearchGate, accessed October 2, 2025, https://www.researchgate.net/publication/357233603_Co-design_and_System_for_the_Supercomputer_Fugaku ↩ ↩2
-
About Fugaku | RIKEN Center for Computational Science RIKEN Website, accessed October 2, 2025, https://www.r-ccs.riken.jp/en/fugaku/about/ ↩
-
An introduction to Fugaku - HPC User Forum, accessed October 2, 2025, https://www.hpcuserforum.com/wp-content/uploads/2021/05/Shoji_RIKEN_Introduction-to-Fugaku_Mar2022-HPC-UF.pdf ↩
-
Intel® Xeon® CPU Max Series - AI, Deep Learning, and HPC Processors, accessed October 2, 2025, https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html ↩ ↩2
-
Performance Analysis of HPC applications on the Aurora Supercomputer: Exploring the Impact of HBM-Enabled Intel Xeon Max CPUs - arXiv, accessed October 2, 2025, https://arxiv.org/html/2504.03632v1 ↩ ↩2 ↩3 ↩4