Hardware-Software Co-Design Is the New Scaling Law: Sonnet 5, NVFP4, and Engram Prove It

Three independent innovations show hardware-aware architecture choices now yield larger capability gains than parameter scaling. Sonnet 5 beats Opus on coding via TPU co-design — at 80% lower cost.

TL;DRBreakthrough 🟢

•Claude Sonnet 5 (82.1% SWE-bench) beats flagship Opus 4.6 (80.84%) through TPU co-optimization alone — no parameter scaling, 80% lower cost. Hardware-software co-design yields larger gains than tier advancement.
•DeepSeek's Engram offloads 100B parameters to system DRAM with <3% throughput penalty via O(1) hash-based retrieval — the Sparsity Allocation Law (20-25% to memory) delivers 3-5 point benchmark improvements on 27B models
•NVFP4 achieves <1% accuracy loss at 4-bit precision via native Tensor Core support — and actually shows +2% AIME improvement over FP8 on DeepSeek-R1 through beneficial regularization
•US export controls on China, intended to limit capability, instead accelerated the co-design innovation that Western labs are now adopting — the constraint-driven approach is becoming the industry standard
•The new scaling law: Performance = f(Parameters, Data, Training Compute, Hardware Alignment, Precision Optimization, Memory Hierarchy Exploitation)

hardware software co-designscaling lawsClaude Sonnet 5NVFP4DeepSeek Engram6 min readFeb 18, 2026

Key Takeaways

Claude Sonnet 5 (82.1% SWE-bench) beats flagship Opus 4.6 (80.84%) through TPU co-optimization alone — no parameter scaling, 80% lower cost. Hardware-software co-design yields larger gains than tier advancement.
DeepSeek's Engram offloads 100B parameters to system DRAM with <3% throughput penalty via O(1) hash-based retrieval — the Sparsity Allocation Law (20-25% to memory) delivers 3-5 point benchmark improvements on 27B models
NVFP4 achieves <1% accuracy loss at 4-bit precision via native Tensor Core support — and actually shows +2% AIME improvement over FP8 on DeepSeek-R1 through beneficial regularization
US export controls on China, intended to limit capability, instead accelerated the co-design innovation that Western labs are now adopting — the constraint-driven approach is becoming the industry standard
The new scaling law: Performance = f(Parameters, Data, Training Compute, Hardware Alignment, Precision Optimization, Memory Hierarchy Exploitation)

Hardware-Software Co-Design Capability Gains (Without Parameter Scaling)

Three independent co-design approaches each deliver benchmark improvements comparable to 2-4x parameter scaling

+1.3 pts

Sonnet 5 vs Opus 4.6 (SWE-bench)

▲ At 80% lower cost

+3-5 pts

Engram vs Baseline (27B model)

▲ Needle recall 84% to 97%

+2%

NVFP4 vs FP8 (AIME 2024)

▲ At 1.8x less memory

<3%

Engram DRAM Offload Penalty

▼ 100B params offloaded

Source: Anthropic, DeepSeek/Peking University, NVIDIA Research

The Paradigm Break: Co-Design Beats Scale

For four years (2020-2024), the dominant paradigm in frontier AI was scaling: larger models, more training data, more compute. The Chinchilla scaling laws (2022) refined this to optimal compute allocation between model size and data, but the fundamental approach remained the same — throw more resources at the problem.

February 2026 data reveals a structural break: the most impactful capability improvements are coming from hardware-software co-design rather than parameter scaling. Three independent evidence streams converge on this conclusion.

Evidence 1: Sonnet 5 Beats Opus Via TPU Co-Optimization

Claude Sonnet 5 at 82.1% SWE-bench Verified outperforms Claude Opus 4.6 at 80.84% — a mid-tier model beating the flagship on the most commercially relevant coding benchmark. The mechanism is not more parameters or more training data; it is co-optimization for Google's Antigravity TPU infrastructure, delivering 50% inference cost reduction through hardware-software co-design.

The Manager Agent architecture (specialized Backend/QA/Infrastructure sub-agents) is more constrained than Opus 4.6's general Agent Teams but better optimized for the specific compute patterns of software engineering tasks. This is a targeted co-design choice: rather than building a general model and hoping it performs well on coding, Anthropic built a model whose inference patterns match the hardware's computational strengths.

The pricing implication seals the argument: Sonnet 5 at $3/1M input tokens is 80% cheaper than Opus 4.5 at $15/1M while exceeding its SWE-bench performance. Hardware co-design simultaneously improved capability AND reduced cost — something parameter scaling alone cannot achieve.

Model	SWE-bench	Cost ($/1M)	Optimization Approach
Claude Opus 4.5	78.9%	$15.00	Parameter scaling
Claude Opus 4.6	80.84%	$5.00	General architecture
Claude Sonnet 5	82.1%	$3.00	TPU co-design + Manager Agent

Evidence 2: DeepSeek Engram DRAM Offload

DeepSeek's Engram conditional memory architecture (published January 12, 2026 with Peking University) introduces a hardware-software co-design principle: offloading 100 billion parameters of hash-based embedding tables to system DRAM rather than GPU HBM, with less than 3% throughput penalty.

This exploits a hardware insight: modern server DRAM (DDR5, multi-terabyte capacity, hundreds of GB/s bandwidth) is vastly underutilized in GPU-centric AI inference. Transformer attention conflates two computationally distinct tasks:

Retrieving stored patterns (entity names, factual associations) — suited for DRAM via O(1) hash lookup
Dynamic contextual reasoning — requires GPU HBM bandwidth for attention computation

Engram separates these, offloading pattern retrieval to DRAM while keeping reasoning on GPU. The empirically discovered "Sparsity Allocation Law" — 20-25% of sparse parameters to memory — yielded 3-5 point benchmark improvements on 27B test models. Needle-in-a-Haystack recall jumped from 84.2% to 97%. These are capability gains achieved through hardware co-design, not through scaling from 27B to 270B parameters.

The "Compute Liberation Effect" — where early layers in Engram models behaviorally resemble much deeper layers in standard MoE models — demonstrates that hardware-aware architecture design can achieve the effective depth of a larger model without the parameter count.

Evidence 3: NVFP4 Native Tensor Core Inference

NVFP4 achieves less than 1% accuracy degradation at 4-bit precision because Blackwell and Rubin Tensor Cores natively process NVFP4 without dequantization overhead. Previous 4-bit formats required converting back to higher precision for computation, losing the memory savings during the critical compute step.

The two-level scaling architecture (E4M3 FP8 micro-block per 16 values + FP32 tensor-level) is tuned to the specific data movement patterns of NVIDIA Tensor Cores. The 16-value block size (versus MXFP4's 32) matches the Tensor Core's warp-level execution, allowing more granular precision adaptation within each computation cycle.

The counterintuitive result: DeepSeek-R1-0528 evaluation shows +2% AIME 2024 accuracy improvement in NVFP4 versus FP8. Lower precision yielding higher accuracy. The likely explanation: NVFP4's regularization effect (quantization noise acting as implicit regularization) benefits certain reasoning chains — a hardware-aware improvement that pure algorithmic research would not discover.

The New Scaling Law

Old scaling law: Performance = f(Parameters, Data, Training Compute)

Emerging co-design law: Performance = f(Parameters, Data, Training Compute, Hardware Alignment, Precision Optimization, Memory Hierarchy Exploitation)

The additional terms are not marginal corrections — they are dominant factors. Sonnet 5's 82.1% versus Opus 4.6's 80.84% is a 1.3-point improvement achieved through hardware co-design alone, without any parameter scaling. DeepSeek's 3-5 point improvement via Engram is achieved through DRAM exploitation. NVFP4's +2% AIME improvement is achieved through precision format design.

A hypothetical model combining all three co-design approaches (TPU-optimized architecture + Engram DRAM offload + NVFP4 precision) could achieve 5-10 benchmark points improvement over a brute-force scaled baseline — equivalent to a 2-4x parameter scaling improvement, at a fraction of the cost.

The Export Controls Paradox

For Chinese labs operating under US export controls, hardware-software co-design is not optional — it is the only path to competitive performance given constrained GPU access. DeepSeek's Engram, Sparse Attention, and mHC innovations are all co-design responses to hardware ceilings imposed by H800 constraints.

The constraint that was supposed to limit Chinese AI capability has instead accelerated architectural innovation that Western labs are now adopting. Export controls created a pressure cooker that produced DeepSeek V4's Engram architecture and Qwen 3.5's 95% activation memory reduction. These innovations are now being adopted by NVIDIA (Nemotron 3 optimized for Rubin) and Anthropic (Sonnet 5 TPU co-optimization). The intended limitation became an innovation accelerant for the entire industry.

What This Means for ML Engineers

Profile your model's compute patterns against target hardware before architecture decisions. Understand your inference hardware's strengths: TPU instruction throughput, GPU Tensor Core warp size, DRAM bandwidth utilization. Co-design your model architecture to match. The days of "train bigger and hope" are over for teams that want to compete on cost efficiency.
Adopt NVFP4 quantization immediately on Blackwell hardware. The <1% accuracy degradation and 3.5x memory reduction make this a no-brainer for production deployments on current NVIDIA GPUs. Implement NVFP4 KV cache quantization for 50% memory reduction on long-context applications.
Consider DRAM offload for embedding-heavy models. If your model has large embedding tables for entity representations or factual knowledge retrieval, the Engram principle (offload hash-based retrieval to DRAM, keep reasoning on GPU) can yield significant cost savings. The O(1) lookup characteristic is key — this works for static knowledge, not dynamic reasoning.
Track hardware partnerships as a capability signal. Labs with deep hardware partnerships (Anthropic-Google TPU, DeepSeek-H800 expertise, NVIDIA-Nemotron-Rubin) are accumulating co-design expertise that compounds over generations. When evaluating a lab's future capability trajectory, their hardware relationship is now as important as their research team quality.
Labs focused on parameter scaling without hardware optimization face diminishing returns. The Rubin generation will deliver 10x cost reduction — but only for models designed to exploit NVFP4, HBM4 bandwidth, and NVLink interconnect characteristics. Models that are hardware-agnostic will underperform models co-optimized for the hardware.

The shift from "train bigger" to "design smarter" favors technically deeper organizations. Hardware-software co-design expertise is now a tier-1 competitive differentiator alongside model architecture and training data quality.