Key Takeaways
- Epoch AI quantifies 300 trillion tokens of public text; compute-optimal scaling will exhaust supply by 2028-2032 depending on scaling assumptions
- MoE architecture reduces data consumption ~6x vs. dense scaling under Chinchilla scaling laws: 120B MoE with 5.1B active parameters behaves like ~20B dense, requiring 6x less training data
- Simultaneous MoE adoption by OpenAI (gpt-oss), Google (Gemma 4), DeepSeek, Qwen, Kimi signals all frontier labs privately acknowledged data constraints, not just efficiency preference
- RL post-training (RLHF, GRPO, DPO) compounds efficiency by converting limited pre-training data into capability gains that would otherwise require 3-5x more tokens — emerged from Chinese labs under compute export controls, now adopted widely
- Training data is now the primary bottleneck, not compute. Labs with proprietary data pipelines (Common Crawl partnerships, web crawl infrastructure) have defensible moat that new entrants cannot replicate
The Data Constraint: 300T Tokens vs. Compute-Optimal Consumption
Epoch AI quantifies approximately 300 trillion tokens of high-quality public human text exist online. This is a hard supply ceiling. Under aggressive compute-optimal scaling (Chinchilla laws), frontier labs will exhaust this supply by 2028-2029. The timing is not accidental — the simultaneous MoE pivot by OpenAI (gpt-oss, August 2025), Google (Gemma 4, April 2026), DeepSeek (V3, R1), Qwen, and Kimi all occurred AFTER frontier labs privately acknowledged data constraints.
The data scarcity lens reveals a deeper architectural imperative than inference efficiency. Under Chinchilla scaling laws, a 120B dense model requires approximately 2.4 trillion training tokens for compute-optimal training. A 120B MoE model with 5.1B active parameters behaves computationally like a ~20B dense model — requiring roughly 400 billion training tokens for compute-optimal training. This is a 6x reduction in data consumption per capability unit. When data supply is constrained, MoE architecture is not optional — it is the structural response to approaching exhaustion.
The Training Data Constraint: Supply vs. Consumption
Shows the approaching exhaustion of public training data relative to frontier model consumption rates
Source: Epoch AI Open Models Threshold (2026-03-15), Chinchilla scaling laws
MoE Adoption as Data Constraint Signal
The pattern is clear in retrospect: the MoE pivot was not driven by optimization papers or efficiency gains (though those are real). It was driven by internal data scarcity forecasts. DeepSeek R1 used RL-based synthetic data generation to overcome compute constraints imposed by US export controls. When DeepSeek proved RL + MoE could match frontier models, Western labs adopted the methodology once the efficiency advantage was proven. Constraint-driven innovation became best practice.
The convergence timing confirms this: OpenAI's gpt-oss announcement and Google's Gemma 4 announcement both released MoE models in April 2026. Both labs chose MoE architecture. Neither announced this choice because of efficiency gains alone — both labs have compute budgets to spare. They chose MoE because data is the constraint.
MoE Architecture Adoption: From Research to Mainstream (2024-2026)
Shows how MoE moved from Mixtral experiment to universal frontier architecture in 24 months, driven by data constraint pressure
Mistral proves MoE viability for open models; 2x efficiency vs equivalent dense
Chinese lab proves RL + MoE can match frontier models under compute constraint — catalytic demonstration
300T token supply quantified; 2028-2032 exhaustion projected under current scaling
OpenAI and Google simultaneously release open-weight MoE models — architecture becomes de facto standard
Source: Mistral AI, DeepSeek, Epoch AI, OpenAI, Google DeepMind
RL Post-Training: Converting Limited Data Into Capability
Reinforcement learning post-training (RLHF, GRPO, DPO) compounds the data efficiency advantage. gpt-oss uses 'high-compute RL' post-training to achieve benchmark parity with much larger models. The mechanism: RL post-training converts limited pre-training data into capability gains that would otherwise require 3-5x more pre-training tokens. You train on less pre-training data, then use RL to refine the model on a narrower, higher-signal dataset (human preferences, synthetic reasoning chains).
This approach emerged from pressure: Chinese labs faced US compute export controls and needed to match frontier capability with constrained infrastructure. They developed RL-based synthetic data generation to overcome compute constraints. When the methodology proved effective, Western labs adopted it. The $300B capital wave (OpenAI $122B, Anthropic $30B) is being deployed partly against this data constraint — labs are building synthetic data pipelines (video, robotics, domain reasoning) to extend the pre-training frontier beyond public text exhaustion.
Data Scarcity Drives Domain Specialization
The data wall also explains the explosion of domain-specialized models (documented in parallel analysis). A medical model trained on 50B high-quality medical tokens outperforms a general model trained on 2 trillion mixed tokens on medical tasks. Domain specialization sidesteps the scarcity problem by prioritizing data quality over quantity. As general pre-training data approaches exhaustion, domain-specific high-quality data becomes more valuable — shifting competitive advantage to data curation over data volume.
This creates a secondary moat: labs that established data pipelines (Common Crawl partnerships, web crawling infrastructure, enterprise data licensing) in 2022-2024 have defensible data moats that cannot be replicated. New entrants face a structural disadvantage: the highest-quality public training data has already been consumed. This is the real barrier to entry in frontier model development — not compute (which can be rented), not architecture (which is published), but training data (which is exhaustible and already claimed).
$300B Capital Wave: Implicit Acknowledgment of Data Constraint
The record $300B Q1 2026 VC funding concentrated in frontier labs (OpenAI $122B, Anthropic $30B) reveals the data problem implicitly. If compute was the bottleneck, capital would flow to hardware (NVIDIA, Groq). If architecture was the bottleneck, capital would flow to model labs. Instead, capital is flowing to frontier labs to fund synthetic data infrastructure, data licensing deals, and alternative pre-training approaches (video, robotics, synthetic reasoning). The capital allocation reveals the internal constraint.
Dimension Research's pretraining analysis confirms this: frontier model pretraining cost has increased 2.4x annually since 2016. The current trajectory projects billion-dollar training runs are already underway, with estimates of $10-100 billion runs within years. But this capital is increasingly deployed not on parameter scaling, but on data infrastructure — the constraint that parameter scaling reveals.
What This Means for Practitioners
ML engineers choosing model architectures for 2026-2027 training runs should default to MoE over dense: the 6x data efficiency advantage is decisive when data supply is constrained. If you are training a frontier-scale model, MoE is no longer optional — it is the structural response to known data constraints.
Teams evaluating build vs. buy for domain models should factor data access as the primary competitive advantage. If you can access proprietary domain data (enterprise datasets, specialized corpora), that data moat is more durable than compute access or algorithmic innovation. A team with 50B high-quality domain tokens can build a domain-specialized model that outperforms frontier models on domain tasks — and no competitor can replicate the data advantage without matching your data access.
Finally, understand that the frontier model market is increasingly constrained by data, not compute. The companies that own large, high-quality training datasets (Common Crawl partnerships, web crawling infrastructure, enterprise data deals) have the durable moat. OpenAI's $122B raise is partly an infrastructure bet on data acquisition and synthetic generation. This reflects the harsh reality: compute is commodity (available via cloud), but data is scarce and defensible.