Skip to content

5.1 Main Memory Parallelism

The evolution of parallelism within main memory is a direct response to the persistent processor-memory performance gap, or “memory wall,” a foundational challenge discussed in earlier chapters. This section provides a technical analysis of the architectural and technological solutions developed to mitigate this bottleneck.

The challenge is rooted in two primary metrics of memory performance:

  • Latency: The time required to access a specific memory location (measured in nanoseconds). It is governed by the physical characteristics of DRAM cells and has proven difficult to improve significantly.1
  • Bandwidth: The rate at which data can be transferred after the initial access (measured in GB/s). It is a function of the memory bus width and clock frequency.

While latency has proven difficult to overcome, decades of architectural innovation have focused on dramatically increasing memory bandwidth. This analysis covers the foundational concepts that introduced parallelism into the memory subsystem, the generational refinements of Double Data Rate (DDR) memory, and culminates in an examination of High Bandwidth Memory (HBM), the 3D-stacked paradigm powering modern high-performance computing systems.

5.1.2 Foundational Concepts: From PRAM to Physical Architectures

Section titled “5.1.2 Foundational Concepts: From PRAM to Physical Architectures”

The theoretical basis for parallel memory access is the Parallel Random-Access Machine (PRAM) model. It provides an idealized framework for designing parallel algorithms, assuming multiple processors can access any memory location in a single time step.2

The PRAM model is refined into several variants based on how it handles simultaneous memory accesses:

  • Exclusive Read, Exclusive Write (EREW): One processor per memory cell for read/write.
  • Concurrent Read, Exclusive Write (CREW): Multiple processors can read, but only one can write.
  • Concurrent Read, Concurrent Write (CRCW): Multiple processors can read and write simultaneously (requires conflict resolution rules).2

Practical hardware implementations rely on techniques to hide the latency the PRAM model ignores. Two key techniques are:

  1. Memory-Level Parallelism (MLP): A microarchitectural strategy where an out-of-order execution engine services multiple cache misses in parallel, overlapping the latency periods of several memory requests to reduce the effective access time.3
  2. Memory Interleaving: A design that spreads contiguous memory addresses across multiple physical memory banks. By directing sequential memory accesses to different banks, the memory controller can pipeline operations, allowing multiple memory accesses to be in different stages of completion simultaneously. This hides individual bank latency and maximizes overall throughput.4

5.1.3 The DDR SDRAM Epoch: Scaling Bandwidth Through Generational Refinements

Section titled “5.1.3 The DDR SDRAM Epoch: Scaling Bandwidth Through Generational Refinements”

For over two decades, Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) was the dominant memory technology. Its evolution is a case study in incremental engineering that delivered exponential bandwidth gains.

The key innovation of DDR was transferring data on both the rising and falling edges of the clock signal, effectively doubling the data transfer rate without increasing the clock frequency.5 This was enabled by a prefetch buffer that fetches a block of data from the slower internal memory array and serializes it onto the faster external bus. The size of this buffer was the primary lever for generational improvement:

  • DDR SDRAM: 2n-prefetch
  • DDR2 SDRAM: 4n-prefetch
  • DDR3 SDRAM: 8n-prefetch
  • DDR4 SDRAM: 8n-prefetch (with bank groups for improved efficiency)
  • DDR5 SDRAM: 16n-prefetch (effectively, via two independent channels)5

This progression is summarized below.

Table 5.1: Generational Comparison of DDR SDRAM Standards

GenerationYear IntroducedData Rate (MT/s)I/O Clock (MHz)Prefetch BufferVoltage
DDR (DDR1)~2000200–400100–2002n2.5 V
DDR2~2003400–1066200–5334n1.8 V
DDR3~2007800–2133400–10668n1.5 V
DDR4~20142133–3200+1066–16008n1.2 V
DDR5~20204800–8400+2400–4200+16n (8n x2)1.1 V
Data sourced from Synopsys.5

5.1.4 The 3D Revolution: High Bandwidth Memory (HBM)

Section titled “5.1.4 The 3D Revolution: High Bandwidth Memory (HBM)”

The power and signal integrity limitations of the planar DDR architecture necessitated a paradigm shift to a three-dimensional design: High Bandwidth Memory (HBM).6 HBM achieves superior performance by using an extremely wide, slower, and shorter communication interface.

This is made possible by three core architectural principles:

  1. Stacked DRAM and Through-Silicon Vias (TSVs): Multiple DRAM dies are stacked vertically and interconnected by thousands of microscopic vertical channels (TSVs), enabling unprecedented density.6
  2. Ultra-Wide Memory Interface: The 3D stacking allows for an extraordinarily wide memory bus (e.g., 1024-bits for an HBM2 stack), compared to the 64-bit interface of a standard DDR5 DIMM.7
  3. Silicon Interposer and 2.5D Packaging: The processor and HBM stacks are placed on a silicon interposer, which contains high-density wiring connecting them over a very short distance. This minimizes latency and power consumption.6

HBM’s design inverts the traditional approach: by making the data path massively wider and shorter, it can operate at a lower frequency, resulting in significantly higher energy efficiency (performance-per-watt).

Table 5.2: Comparison of DDR5 and HBM2E Memory Technologies

FeatureDDR5 (per module)HBM2E (per stack)Advantage
Bus Width64-bit1024-bitHBM (16x wider)
Peak Bandwidth~51.2 GB/s (for DDR5-6400)460 GB/sHBM (~9x higher)
ArchitecturePlanar (2D)Stacked (3D)HBM (denser, shorter path)
Power EfficiencyLowerHigherHBM
Primary Use CaseGeneral-purpose computingHPC, AI, High-end GPUsSpecialized
Data sourced from multiple industry reports.8 9

5.1.5 Case Study: HBM in Exascale Supercomputing

Section titled “5.1.5 Case Study: HBM in Exascale Supercomputing”

The impact of HBM is most evident in exascale supercomputing, where performance is often limited by data movement.

Fugaku: Maximizing Bandwidth for Scientific Discovery

Section titled “Fugaku: Maximizing Bandwidth for Scientific Discovery”

The Fugaku supercomputer’s Fujitsu A64FX processor integrates 32 GiB of HBM2 memory per CPU, delivering 1024 GB/s of memory bandwidth per node.10 This massive bandwidth is critical for memory-bound scientific simulations, such as:

  • Climate modeling
  • Molecular dynamics
  • Genomic analysis

By ensuring its powerful vector processing units are constantly fed with data, the A64FX architecture has demonstrated performance up to 2.3 times that of contemporary high-end x86 servers on a range of scientific applications, a lead largely attributed to its high-bandwidth memory.11

Aurora: Hybrid Memory for Architectural Flexibility

Section titled “Aurora: Hybrid Memory for Architectural Flexibility”

The Aurora supercomputer’s Intel Xeon CPU Max Series processor features a hybrid memory architecture, integrating both on-package HBM2e and off-chip DDR5 memory.12 This allows the system to be configured in different modes:

  • Flat Mode: HBM and DDR are presented as distinct memory regions, allowing programmers to explicitly allocate critical data to the fast HBM.
  • Cache Mode: The HBM acts as a large, transparent last-level cache for the main DDR5 memory, providing a performance uplift for legacy applications without code changes.13

This flexibility allows Aurora to deliver breakthrough performance on a diverse range of scientific workloads, achieving gains of up to 4.8x on memory-bandwidth-bound codes compared to competing CPUs.12

Together, Fugaku and Aurora demonstrate that HBM is a foundational technology for the exascale era, providing the necessary bandwidth to overcome the memory wall for the world’s most demanding computational problems.

  1. Synchronous dynamic random-access memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory

  2. Parallel RAM - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Parallel_RAM 2

  3. What is memory level parallelism? - Quora, accessed October 2, 2025, https://www.quora.com/What-is-memory-level-parallelism

  4. Interleaved memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/Interleaved_memory

  5. DDR Generations: Memory Density and Speed | Synopsys Blog, accessed October 2, 2025, https://www.synopsys.com/blogs/chip-design/ddr-generations-memory-density-speed.html 2 3

  6. High Bandwidth Memory: Concepts, Architecture, and Applications - Wevolver, accessed October 2, 2025, https://www.wevolver.com/article/high-bandwidth-memory 2 3

  7. High Bandwidth Memory - Wikipedia, accessed October 2, 2025, https://en.wikipedia.org/wiki/High_Bandwidth_Memory

  8. Micron’s Perspective on Impact of CXL on DRAM Bit Growth Rate, accessed October 2, 2025, https://assets.micron.com/adobe/assets/urn:aaid:aem:b2e25f63-85a2-44c9-b46f-717830deefa5/renditions/original/as/cxl-impact-dram-bit-growth-white-paper.pdf

  9. Understanding the “Memory Wall” - Ruturaj Patki, accessed October 2, 2025, https://blog.ruturajpatki.com/understanding-the-memory-wall/

  10. About Fugaku | RIKEN Center for Computational Science RIKEN Website, accessed October 2, 2025, https://www.r-ccs.riken.jp/en/fugaku/about/

  11. An introduction to Fugaku - HPC User Forum, accessed October 2, 2025, https://www.hpcuserforum.com/wp-content/uploads/2021/05/Shoji_RIKEN_Introduction-to-Fugaku_Mar2022-HPC-UF.pdf

  12. Intel® Xeon® CPU Max Series - AI, Deep Learning, and HPC Processors, accessed October 2, 2025, https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html 2

  13. Performance Analysis of HPC applications on the Aurora Supercomputer: Exploring the Impact of HBM-Enabled Intel Xeon Max CPUs - arXiv, accessed October 2, 2025, https://arxiv.org/html/2504.03632v1