6.3 Modern GPU Architecture

The unified architecture established the design principle for modern GPUs, but implementations have grown more complex. A modern GPU is a hierarchical system, not a flat array of cores. The processors—CUDA Cores (NVIDIA) or Stream Processors (AMD)—are grouped into execution blocks known as a Streaming Multiprocessor (SM) or a Compute Unit (CU), respectively.¹ These SMs and CUs are the primary computational units of the GPU, containing instruction schedulers, register files, execution units, and local memory or caches.² These blocks are replicated across the die and often grouped into larger structures, such as NVIDIA’s Graphics Processing Clusters (GPCs).³ This hierarchical design is essential for managing complexity and scaling performance. Over the last decade, the SM and CU have evolved differently, reflecting the distinct strategic priorities of NVIDIA and AMD.

The NVIDIA Approach: The Streaming Multiprocessor (SM)

NVIDIA’s SM has evolved to maximize throughput and incorporate specialized hardware for AI and HPC workloads.

Pascal (2016): The GP100 SM featured 64 FP32 CUDA cores, fewer than the 128 in the prior Maxwell architecture, but partitioned them into two blocks, each with its own warp scheduler. This allowed for finer-grained resource allocation and improved instruction-level parallelism, increasing overall efficiency.⁴
Volta (2017): The Volta SM introduced a significant architectural change with the Tensor Core, a specialized unit for the mixed-precision 4x4 matrix multiply-accumulate operations used in deep learning.⁵ A single Volta SM with eight Tensor Cores provided a substantial increase in deep learning training performance over Pascal.⁶ Volta also added dedicated INT32 cores, allowing concurrent execution of integer and floating-point instructions.⁵
Ampere (2020): The Ampere SM continued this strategy with third-generation Tensor Cores supporting new numerical formats (e.g., TF32) and hardware-accelerated sparsity.⁷ To address data bottlenecks, Ampere increased the L1 cache/shared memory capacity to 192 KB per SM and introduced asynchronous copy instructions to move data from global to shared memory more efficiently.⁸

The AMD Approach: The Compute Unit (CU)

AMD’s architecture shifted from a general compute-centric design to one optimized for real-time gaming.

Graphics Core Next (GCN) (2012-2019): The GCN CU was a capable compute engine with four 16-wide SIMD vector units and a 64 KB Local Data Share (LDS).² It used a 64-thread “Wavefront” (Wave64) execution model. While effective for GPGPU tasks, its 4-cycle instruction issue latency could be inefficient for the divergent shaders in modern games.⁹
Radeon DNA (RDNA) (2019-Present): RDNA was a redesign focused on gaming performance, introducing the Workgroup Processor (WGP) which contains two CUs.¹⁰ The execution model was changed to a 32-thread wavefront (Wave32) to reduce latency and improve efficiency with divergent branches.⁹ RDNA also redesigned the cache hierarchy and improved the SIMD units to issue an instruction in a single cycle, a significant latency reduction from GCN.⁹

The SM and CU architectures reflect NVIDIA’s and AMD’s differing market strategies. NVIDIA’s investment in specialized hardware like Tensor Cores is driven by its position in the data center and AI markets.⁵ AMD’s role as the GPU provider for major consoles incentivizes optimization for gaming, with the RDNA architecture’s focus on latency reduction and shader efficiency being a direct result.⁹

While core counts and raw FLOPS are common marketing metrics, the primary architectural challenge has become the memory bottleneck. As the number of cores increases, the ability to supply them with data is the main constraint on performance.¹¹

Features like NVIDIA’s asynchronous copy instructions and AMD’s multi-level cache hierarchy are designed to improve data logistics, not just computation. The effectiveness of a modern GPU’s memory subsystem is an increasingly critical factor in its overall performance.

The following table provides a high-level comparison of the two modern architectural philosophies.

Feature	NVIDIA Ampere SM	AMD RDNA 2 WGP (Dual CU)
Core Execution Unit	128 FP32 CUDA Cores (with dual-path for FP32).	2 CUs, each with two 32-wide SIMD units.
Thread Grouping	Warp (32 threads).	Wavefront (Native Wave32, supports Wave64).
Key Specialization	4x Third-Gen Tensor Cores (AI/Matrix Math), 1x RT Core.	1x Ray Accelerator per CU.
Local Memory/Cache	192 KB unified L1 Data Cache / Shared Memory.	128 KB L1 Cache per WGP, 16KB L0 per CU.
Architectural Focus	Optimized for AI/HPC workloads and graphics.	Optimized for gaming latency and efficiency.

References

History and Evolution of GPU Architecture, accessed October 3, 2025, https://mcclanahoochie.com/blog/wp-content/uploads/2011/03/gpu-hist-paper.pdf ↩
Graphics Core Next - Wikipedia, accessed October 3, 2025, https://en.wikipedia.org/wiki/Graphics_Core_Next ↩ ↩²
Understanding Modern Gpu Architecture - Mohit Mishra, accessed October 3, 2025, https://mohitmishra786.github.io/chessman/2024/11/24/Understanding-Modern-GPU-Architecture.html ↩
NVIDIA Pascal GP100 Architecture Deep-Dive | GamersNexus, accessed October 3, 2025, https://gamersnexus.net/guides/2423-nvidia-pascal-gp100-architecture-deep-dive-specs ↩
Volta Tuning Guide - NVIDIA Docs, accessed October 3, 2025, https://docs.nvidia.com/cuda/pdf/Volta_Tuning_Guide.pdf ↩ ↩² ↩³
NVIDIA TESLA V100 GPU ARCHITECTURE, accessed October 3, 2025, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf ↩
NVIDIA A100 Tensor Core GPU Architecture, accessed October 3, 2025, https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf ↩
NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog, accessed October 3, 2025, https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ ↩
[AMD] RDNA Whitepaper : r/Amd - Reddit, accessed October 3, 2025, https://www.reddit.com/r/Amd/comments/ctfbem/amd_rdna_whitepaper/ ↩ ↩² ↩³ ↩⁴
RDNA Architecture - AMD GPUOpen, accessed October 3, 2025, https://gpuopen.com/download/RDNA_Architecture_public.pdf ↩
Evolution of Graphics Pipelines — mcs572 0.7.8 documentation, accessed October 3, 2025, http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec28.html ↩

6.3 Modern GPU Architecture

The NVIDIA Approach: The Streaming Multiprocessor (SM)

The AMD Approach: The Compute Unit (CU)

References

Footnotes