Skip to content

10.3 The Dominant Trend: Domain-Specific Architectures (DSAs)

The End of the Golden Age: Why General-Purpose Is No Longer Enough

Section titled “The End of the Golden Age: Why General-Purpose Is No Longer Enough”

The most immediate and impactful trend shaping the future of parallelism is the dramatic industry-wide pivot away from general-purpose processors (GPPs) and toward Domain-Specific Architectures (DSAs). A DSA is a microprocessor tailored for a specific application domain, such as artificial intelligence, networking, or computer vision. It achieves significant gains in performance and efficiency by making architectural trade-offs that favor a narrow set of tasks over the broad flexibility of a GPP.1 This shift is not a matter of choice but a necessary response to the closure of the “golden age” of free performance, a period brought to an end by the collision with four fundamental “walls”.1

  1. The End of Free Scaling: For decades, the combined effects of Moore’s Law (more transistors) and Dennard Scaling (constant power density) provided a virtuous cycle of exponential performance growth.1 Software developers could rely on the hardware to get faster with each generation. With the breakdown of Dennard scaling around 2005 due to leakage currents and the subsequent slowing of Moore’s Law, this free ride came to an end. Performance gains now have to be explicitly designed and paid for in terms of complexity and power.2
  2. The Power Wall: Energy has become the primary constraint in modern computing system design.1 In large data centers and even in mobile devices, the power required to simply move data from main memory to the processor core can consume more energy than the actual computation itself. GPPs, with their complex control logic and speculative execution engines, are inherently less energy-efficient than streamlined, specialized hardware.1
  3. The Memory Wall: The performance gap between fast processors and slower main memory continues to widen. As a result, GPPs spend an increasing percentage of their time stalled, waiting for data to arrive.1 This latency bottleneck limits the effective utilization of the processor’s computational resources.
  4. The Parallelism Wall: While modern workloads, especially in AI and data analytics, are massively parallel, GPPs with a few large, complex cores are not the most efficient way to exploit this. They are designed for low latency on a single instruction stream, whereas these workloads benefit from high throughput across thousands of parallel data streams.1

DSAs address these four walls directly. By specializing, they eliminate the complex, power-hungry logic required for general-purpose computation. They are co-designed with a specific memory hierarchy to minimize data movement, often featuring large on-chip memories. And they are built from the ground up with massive parallelism, using thousands of simple, efficient compute elements instead of a handful of heavyweight cores.1

The market’s response to the limitations of GPPs has been a fervent and rapid diversification of hardware, a phenomenon aptly dubbed the “Cambrian Explosion” of AI hardware.3 Just as the biological Cambrian Explosion saw a rapid emergence of diverse life forms, the last decade has witnessed a proliferation of hardware startups and major R&D projects at established companies, all aiming to build custom silicon for specific, high-value workloads.4 The primary catalyst for this explosion is the insatiable computational demand of deep learning.5 Neural network training and inference are characterized by massive matrix multiplications and a high tolerance for low-precision arithmetic, making them a perfect target for specialized hardware that can outperform GPPs by orders of magnitude in both speed and efficiency.1 This dynamic landscape is populated by a host of innovative players. A wave of well-funded startups—including Graphcore, SambaNova Systems, Tenstorrent, and Groq—are challenging the status quo with novel architectures.6 Simultaneously, the hyperscale cloud providers and established semiconductor giants have invested billions in their own DSA projects to optimize their infrastructure and products. This includes Google’s Tensor Processing Unit (TPU), Amazon’s Inferentia and Trainium chips, and dedicated AI accelerator projects at AMD, Intel, and IBM, all competing with the dominant incumbent in AI acceleration, NVIDIA.6

Case Study: The AI Accelerator Duel - Google’s TPU vs. NVIDIA’s GPU

Section titled “Case Study: The AI Accelerator Duel - Google’s TPU vs. NVIDIA’s GPU”

The competition between NVIDIA’s GPUs and Google’s TPUs provides a concrete illustration of the fundamental trade-off at the heart of the DSA trend: general-purpose flexibility versus domain-specific efficiency.

  • NVIDIA’s GPU (General-Purpose Parallelism): The Graphics Processing Unit has undergone a remarkable evolution. Originally a fixed-function pipeline for rendering 3D graphics, the GPU became a fully programmable parallel processor with the advent of programmable shaders and, most critically, the introduction of compute frameworks like CUDA in 2006 and the open standard OpenCL.7 This transition to General-Purpose computing on GPUs (GPGPU) unlocked the massive parallelism of the GPU’s architecture—thousands of simple cores organized into Streaming Multiprocessors (SMs)—for a wide range of scientific and data-parallel tasks.8 Its strength lies in its programmability and its high performance on 32-bit and 64-bit floating-point arithmetic, making it a versatile workhorse for both high-performance computing (HPC) and AI.7
  • Google’s TPU (Specialized Acceleration): The Tensor Processing Unit, in contrast, is a purpose-built Application-Specific Integrated Circuit (ASIC) designed from the ground up for one primary task: accelerating the tensor operations at the heart of neural networks.9 The architectural centerpiece of the TPU is a systolic array, a large two-dimensional grid of multiply-accumulate (MAC) units that can perform massive matrix multiplications with extreme efficiency.9 The TPU eschews the complex control logic and high-precision floating-point units of a GPU in favor of a design optimized for high-volume, low-precision computation (e.g., 8-bit integers and the 16-bit bfloat16 format).9 This specialization makes it less flexible than a GPU but allows it to achieve superior performance and power efficiency on its target workload.10

The following table provides a direct architectural comparison, crystallizing the core trade-offs between these two leading approaches to AI acceleration.

FeatureNVIDIA GPU (e.g., A100)Google TPU (e.g., v4)
Design PhilosophyGeneral-Purpose Parallel ProcessorApplication-Specific Integrated Circuit (ASIC)
Primary WorkloadGraphics, HPC, AI (Training & Inference)AI (Training & Inference), specifically Neural Networks
Core ArchitectureThousands of general-purpose CUDA cores in Streaming Multiprocessors (SMs)Large Systolic Array for Matrix Multiplication (MXU)
Precision FocusHigh-performance FP64, FP32, TF32, FP16, INT8Optimized for low-precision: bfloat16, INT8
ProgrammabilityHigh (via CUDA, OpenCL). Flexible for diverse algorithms.Lower. Optimized for TensorFlow/JAX/PyTorch tensor operations.
On-Chip MemoryLarge caches and High-Bandwidth Memory (HBM)Very large on-chip memory (High Bandwidth Memory) to feed the MXU.
Key AdvantageFlexibility and programmability for a wide range of parallel tasks.Extreme performance and power efficiency on dense matrix multiplication.
Data sourced from 7

Democratizing Design: The Role of the RISC-V Open Standard

Section titled “Democratizing Design: The Role of the RISC-V Open Standard”

Fueling the Cambrian Explosion of DSAs is a quiet but powerful revolution in how processors are designed: the rise of the RISC-V instruction set architecture (ISA).11 An ISA is the fundamental interface between hardware and software, defining the set of instructions a processor can execute. Historically, dominant ISAs like x86 and ARM have been proprietary, requiring expensive licenses to use.11 RISC-V breaks this mold by being a free and open standard, developed collaboratively by academia and industry.11 This has two profound implications for the creation of DSAs:

  1. Open and Royalty-Free: The absence of licensing fees dramatically lowers the financial barrier to entry for designing a custom chip. Startups, research institutions, and even larger companies can now develop bespoke processors without the multi-million-dollar upfront cost of a proprietary ISA license.11
  2. Modular and Extensible: RISC-V is intentionally designed to be simple and modular. It features a small, standard base integer instruction set with a wide range of optional standard extensions (for multiplication, floating-point, etc.).11 Crucially, it also provides a framework for adding custom, non-standard instructions. This allows designers to create highly specialized processors that include only the logic necessary for their target domain, and even to embed custom instructions that directly accelerate their most critical algorithms.11

This combination of open access and technical flexibility makes RISC-V an ideal foundation for building the next generation of DSAs. It acts as a democratizing force, enabling a far wider range of players to participate in hardware innovation. This democratization is creating a new, tighter “flywheel” for hardware-software co-design. The end of general-purpose scaling created the demand for specialized hardware. RISC-V provides the means for a diverse ecosystem to meet that demand, leading to the Cambrian Explosion of new architectures. Each of these new hardware designs, in turn, is useless without a corresponding software stack—compilers, libraries, and runtimes—that can effectively program it. This creates a powerful pull for software innovation, as seen with frameworks like Intel’s Lava for its Loihi chip.12 This virtuous cycle, where open hardware enables new software, which makes the hardware more useful and encourages further hardware innovation, is a fundamental shift. It mirrors the open-source software revolution that transformed that industry, but now applied to the world of silicon, promising an accelerated, community-driven evolution of computing.

  1. The End of the Golden Age: Why Domain-Specific Architectures are …, accessed October 9, 2025, https://medium.com/@riaagarwal2512/the-end-of-the-golden-age-why-domain-specific-architectures-are-redefining-computing-083f0b4a4187 2 3 4 5 6 7 8 9

  2. Dennard scaling - Wikipedia, accessed October 9, 2025, https://en.wikipedia.org/wiki/Dennard_scaling

  3. IBM Research’s new NorthPole AI chip, accessed October 9, 2025, https://research.ibm.com/blog/northpole-ibm-ai-chip

  4. The AI Cambrian Explosion: When Machines Learned to Think | by Myk Eff | Higher Neurons, accessed October 9, 2025, https://medium.com/higher-neurons/the-ai-cambrian-explosion-when-machines-learned-to-think-56b7de31d364

  5. Building the IBM Spyre Accelerator, accessed October 9, 2025, https://research.ibm.com/blog/building-the-ibm-spyre-accelerator

  6. What’s with the “Cambrian-AI” theme? - Cambrian AI Research, accessed October 9, 2025, https://cambrian-ai.com/whats-with-the-cambrian-ai-theme/ 2

  7. General-purpose computing on graphics processing units - Wikipedia, accessed October 9, 2025, https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units 2 3

  8. The Evolution of CPUs and GPUs: A Historical Perspective | OrhanErgun.net Blog, accessed October 9, 2025, https://orhanergun.net/the-evolution-of-cpus-and-gpus-a-historical-perspective

  9. Tensor Processing Unit - Wikipedia, accessed October 9, 2025, https://en.wikipedia.org/wiki/Tensor_Processing_Unit 2 3

  10. Understanding TPUs vs GPUs in AI: A Comprehensive Guide - DataCamp, accessed October 9, 2025, https://www.datacamp.com/blog/tpu-vs-gpu-ai

  11. RISC-V - Wikipedia, accessed October 9, 2025, https://en.wikipedia.org/wiki/RISC-V 2 3 4 5 6

  12. Intel Advances Neuromorphic with Loihi 2, New Lava Software Framework and New Partners, accessed October 9, 2025, https://www.intc.com/news-events/press-releases/detail/1502/intel-advances-neuromorphic-with-loihi-2-new-lava-software