AI Compute Stack Splitting Into Three Incompatible Worlds

AMD MI355X hits 1M tokens/sec, challenging NVIDIA's inference moat. Anthropic's 3.5GW TPU removes $42B/year from NVIDIA's TAM. Gemma 4 MoE runs on Jetson Nano. USC memristors compute via physics. Three incompatible optimization targets are emerging.

TL;DRNeutral ⚪

•AMD MI355X achieves 1,031,070 tokens/sec on MLPerf with 97-98% multi-node scaling—40% better tokens-per-dollar than NVIDIA B200
•Anthropic's 3.5GW TPU commitment removes $42B annualized revenue from NVIDIA's addressable market starting 2027
•Gemma 4 E2B runs offline on Jetson Nano; 26B MoE activates only 3.8B params/token, enabling edge deployment
•USC memristor performs matrix multiply natively via Ohm's Law—potential 3-5 year timeline to commercial AI compute disruption
•Market bifurcating into three incompatible worlds: datacenter GPU, custom silicon (TPU), edge-native (MoE + novel hardware)

hardwareNVIDIAAMDTPUBroadcom5 min readApr 7, 2026

High ImpactMedium-termML engineers deploying at scale should evaluate AMD MI355X for inference workloads (40% better tokens-per-dollar vs B200). Teams building edge applications should target Gemma 4 MoE for 3.8B-param inference budgets. Organizations locked into NVIDIA should monitor ROCm maturity — the 90% HIPIFY conversion rate means migration is feasible but not frictionless.Adoption: AMD MI355X available now for inference fleet evaluation. Gemma 4 edge deployment available now. TPU-at-scale is enterprise-only (Anthropic-class customers). Memristor AI compute is 3-5 years from commercial deployment.

Cross-Domain Connections

AMD MI355X: 1,031,070 tokens/sec on MLPerf, 97-98% scaling efficiency, ROCm 7 claims 30% faster than CUDA on inference→Anthropic 3.5GW TPU commitment via Broadcom (SEC 8-K), Mizuho estimates $42B Broadcom AI revenue in 2027

NVIDIA faces simultaneous margin pressure (AMD offering 40% better tokens-per-dollar) and TAM reduction (Anthropic's $42B flowing to TPU/Broadcom instead of NVIDIA). The combination is more threatening than either alone: AMD sets the price ceiling, TPU removes the largest buyers.

Gemma 4 E2B runs on Jetson Nano offline; 26B MoE activates only 3.8B params/token→USC memristor performs matrix multiply via Ohm's Law; TetraMem commercializing room-temperature version

Edge inference is converging from two directions: models getting efficient enough for current edge hardware (Gemma 4 MoE), and hardware getting efficient enough for current models (memristors). When these trajectories cross, edge AI deployment economics undergo a step-change.

NVIDIA announced RTX acceleration for Gemma 4 on day of release→AMD MI355X 3.1x generation-over-generation improvement on single-node inference

NVIDIA's strategy is to own every layer: datacenter (H100/B200), edge (RTX/Jetson), and software (CUDA). AMD's strategy is to win on datacenter inference economics. The tripartition benefits NVIDIA only if it can maintain cross-paradigm relevance — which RTX + Jetson + CUDA ecosystem currently enables.

Key Takeaways

AMD MI355X achieves 1,031,070 tokens/sec on MLPerf with 97-98% multi-node scaling—40% better tokens-per-dollar than NVIDIA B200
Anthropic's 3.5GW TPU commitment removes $42B annualized revenue from NVIDIA's addressable market starting 2027
Gemma 4 E2B runs offline on Jetson Nano; 26B MoE activates only 3.8B params/token, enabling edge deployment
USC memristor performs matrix multiply natively via Ohm's Law—potential 3-5 year timeline to commercial AI compute disruption
Market bifurcating into three incompatible worlds: datacenter GPU, custom silicon (TPU), edge-native (MoE + novel hardware)

NVIDIA Faces Simultaneous Margin Pressure and TAM Reduction

AMD's MI355X GPUs achieved 1,031,070 tokens/sec on MLPerf Inference 6.0, with 95-98% multi-node scaling efficiency verified through independent reproduction. Performance vs NVIDIA B200: AMD matched B200 in offline throughput, reached 93% in server performance, and exceeded B300 by 104% in interactive mode. In tokens-per-dollar, AMD delivers 40% better economics than NVIDIA B200.

This is margin pressure from below: NVIDIA's pricing power erodes if AMD can undercut on inference workloads (typically 70-80% of total datacenter AI compute spend).

Simultaneously, Anthropic's 3.5GW TPU commitment through Broadcom, announced April 6, 2026, removes the largest frontier lab buyer from NVIDIA's addressable market. Mizuho analyst estimates place Broadcom's TPU revenue at $42B annually by 2027. That $42B flows to Google-custom silicon, not NVIDIA GPUs.

The combination is more threatening than either alone: AMD sets the price ceiling on inference, TPU removes the largest buyers. For NVIDIA, this is a two-front squeeze.

NVIDIA Datacenter GPU Market Position

Key metrics showing NVIDIA's declining but still dominant market position

90%

NVIDIA Market Share (2024)

Baseline

86%

NVIDIA Market Share (2026)

▼ -4pp

1M TPS

AMD MI355X (single-node)

▲ +3.1x vs MI325X

$42B/yr

Anthropic TPU (exits NVIDIA TAM)

▲ New in 2027

Source: SemiAnalysis, Mizuho analyst estimates, AMD MLPerf submission

Three Incompatible Compute Paradigms Are Emerging

Paradigm 1: Datacenter GPU (NVIDIA/AMD)

Optimization target: Max flexibility + throughput. Works for general training (any model, any architecture) and general inference (any model). NVIDIA dominates training (95%+ market share); AMD competes on inference economics. Lock-in: CUDA ecosystem, custom kernels for new architectures. Cost: Per-GPU procurement + training infrastructure.

Paradigm 2: Custom Silicon (Google TPU / Broadcom)

Optimization target: Workload-specific efficiency on fixed architectures. Anthropic's workload is known: transformer inference at massive scale, serving 1000+ customers. TPU architectures are optimized precisely for this: massive batch inference, systolic array matrix multiply, high interconnect bandwidth for all-reduce. Lock-in: Google Cloud / Broadcom supply agreements, infrastructure-scale contracts. Cost: Long-term supply agreements ($42B+ commitments), not per-GPU purchasing.

Paradigm 3: Edge-Native (MoE + Novel Hardware)

Optimization target: Watts-per-inference on edge devices and offline systems. Gemma 4 E2B/E4B variants run offline on Jetson Nano, with 26B MoE model activating only 3.8B parameters per token. MoE routing reduces per-token compute by 75%. USC's memristor chip performs matrix multiply natively via Ohm's Law, bypassing the von Neumann bottleneck. Lock-in: Model-hardware co-optimization (can't easily swap models between edge hardware). Cost: Device amortization over deployment lifetime, not compute pricing.

Three AI Compute Paradigms: Who Wins Where

Comparison of datacenter GPU, custom silicon, and edge-native compute across key dimensions

Lock-in	Best For	Paradigm	Cost Model	Key Metric	Optimization Target
CUDA ecosystem	General training + inference	Datacenter GPU (NVIDIA/AMD)	Per-GPU procurement	1M+ tokens/sec (AMD MLPerf)	Max throughput/flexibility
Google Cloud / Broadcom	Massive-scale API inference	Custom Silicon (Google TPU)	Long-term supply agreement	3.5GW committed capacity	Workload-specific efficiency
Model-hardware co-optimization	Offline, privacy, low-latency	Edge-Native (MoE + Novel HW)	Device cost amortization	3.8B active params/token	Watts-per-inference

Source: AMD MLPerf, Broadcom SEC 8-K, Google DeepMind Gemma 4, USC memristor research

These Paradigms Are Incompatible, Not Sequential

The traditional GPU supply chain assumes a single compute paradigm: more compute power. You buy a better GPU, run the same software faster, and scale. That's true for NVIDIA vs AMD in the datacenter—both are general-purpose GPUs.

But TPU and edge-native are fundamentally incompatible with this model. AMD's MLPerf submission uses FP4 quantization and ROCm 7 software optimizations, leveraging flexible GPU architecture to match TPU performance on specific workloads. But Google TPU doesn't compete on flexibility—it wins on efficiency for known workloads.

Edge models can't use TPU infrastructure; TPU can't deploy edge-native systems. These are separate markets with separate cost curves, separate lock-in dynamics, and separate customer bases.

NVIDIA Owns Multiple Layers to Stay Relevant

NVIDIA's strategy is to own every layer: datacenter (H100/B200), edge (RTX GPUs, Jetson Orin Nano), and software (CUDA ecosystem). NVIDIA accelerated Gemma 4 for RTX deployment on day of release, providing NVIDIA GPU acceleration for Google's model. This creates lock-in: developers build on Gemma 4 + RTX, then need NVIDIA GPUs for production deployment.

AMD's strategy is to win on datacenter inference economics, the largest single market. AMD doesn't have equivalent edge hardware (no RTX-grade consumer GPU for AI) or software ecosystem (ROCm adoption lags CUDA). AMD's bet: win datacenter, gain leverage to negotiate software partnerships.

The tripartition benefits NVIDIA only if it maintains cross-paradigm relevance—which RTX + Jetson + CUDA ecosystem currently enables. If paradigms become truly separate (TPU customers never use NVIDIA, edge deployments use specialized hardware), NVIDIA shrinks to "general GPU vendor" competing on price with AMD.

Novel Hardware Could Disrupt Everything in 3-5 Years

USC's memristor chip operates at 700C, survives 1B+ switching cycles, and TetraMem is commercializing room-temperature variants. Memristors perform matrix multiplication natively using Ohm's Law—no transistor switching, no digital-analog conversion cycles. This is a potential 100-1000x improvement in watts-per-operation for matrix multiply.

3-5 year commercialization timeline means TetraMem could have production AI memristor chips by 2029-2030. If so, the architectural assumptions underlying TPU and GPU design become suboptimal. A new hardware paradigm optimized for memristor physics would disrupt all three compute worlds simultaneously.

Current infrastructure bets (Anthropic's 3.5GW, NVIDIA's datacenter dominance, Google's TPU roadmap) assume silicon-based computing is architecturally stable through 2030-2031. Memristor commercialization would shorten that horizon.

What This Means for Practitioners

ML engineers deploying at scale should evaluate compute paradigms by workload type: (1) For general training and research, GPUs remain dominant. But evaluate AMD MI355X for inference workloads—40% better tokens-per-dollar justifies integration testing and ROCm ecosystem learning. (2) For massive-scale serving (like Anthropic's 1000+ customers), TPU-scale custom silicon is correct. General GPUs are overprovisioned. (3) For edge and offline deployment, adopt Gemma 4 MoE variants and target 3-5W inference budgets using edge hardware (Jetson, Qualcomm Snapdragon, MediaTek).

Organizations locked into NVIDIA should monitor ROCm maturity and AMD's partner ecosystem. The 90% HIPIFY (CUDA-to-HIP transpiler) conversion rate means migration is feasible but not frictionless. Plan 6-12 month migrations if you switch GPU vendors.

For long-term planning (2028+), allocate R&D budget to memristor AI compute research. The 3-5 year timeline means prototype programs should start now to have production-ready implementations when commercial chips arrive.