Benchmark Crisis: GPT-4.5's 0% ARC-AGI Score Contradicts 88% MMLU — What Do AI Benchmarks Actually Measure?

GPT-4.5 scores 0% on ARC-AGI-2 while achieving 88%+ on MMLU. Claude dominates OSWorld at 72.5% but ARC-AGI-3 reveals near-0% AI performance on interactive reasoning. The industry has built $100B+ in valuations on benchmarks measuring memorization rather than genuine capability.

TL;DRCautionary 🔴

•GPT-4.5 and Llama 4 score 0% on ARC-AGI-2 despite 88%+ MMLU — benchmarks measure different things than they claim
•ARC-AGI-3 preview shows 100% human success vs. near-0% AI on interactive novel-environment tasks
•Circular data contamination: 74.2% of new web content is AI-generated, with model collapse risk at 1-in-1,000 synthetic samples
•Two-tier benchmark reality: practical benchmarks (OSWorld, SWE-bench) show real capability; abstract benchmarks (MMLU, HumanEval) measure training data coverage
•ARC-AGI-3 launch March 25 will force public reckoning with benchmark legitimacy and reshape model card reporting within 6 months

benchmarksevaluationarc-agisynthetic-datamodel-collapse5 min readMar 2, 2026

Key Takeaways

GPT-4.5 and Llama 4 score 0% on ARC-AGI-2 despite 88%+ MMLU — benchmarks measure different things than they claim
ARC-AGI-3 preview shows 100% human success vs. near-0% AI on interactive novel-environment tasks
Circular data contamination: 74.2% of new web content is AI-generated, with model collapse risk at 1-in-1,000 synthetic samples
Two-tier benchmark reality: practical benchmarks (OSWorld, SWE-bench) show real capability; abstract benchmarks (MMLU, HumanEval) measure training data coverage
ARC-AGI-3 launch March 25 will force public reckoning with benchmark legitimacy and reshape model card reporting within 6 months

The Benchmark Divergence: Same Models, Radically Different Scores

Three data points reveal a fundamental fracture in AI evaluation credibility:

Data Point 1: The MMLU-vs-ARC Gap
GPT-4.5 scores 88%+ on MMLU while achieving 0% on ARC-AGI-2. Llama 4 exhibits the same pattern. This is not a marginal difference — it is a categorical contradiction. How can a system be simultaneously near-human on standardized knowledge tests and zero-capable on abstract reasoning?

Data Point 2: ARC-AGI-3 Preview
ARC-AGI-3's interactive environments show 100% human success rate while current frontier models make near-0% efficient progress. This is not edge-case performance — these are environments humans describe as "easy and often fun."

Data Point 3: GLM-5 Self-Reported Metrics
Zhipu claims 94.2% on HumanEval and 50.4% on Humanity's Last Exam with a 34% hallucination rate (down from 90%), yet provides no independent verification. This is benchmark theatre.

What Each Benchmark Actually Tests

MMLU/HumanEval: Pattern Matching
These benchmarks measure how well model outputs match patterns heavily represented in training data. High scores reflect data coverage, not reasoning ability. The 28.5 trillion token training corpus of GLM-5 likely contains benchmark-adjacent material from web scraping.

SWE-bench/OSWorld: Practical Task Completion
These benchmarks require multi-step execution in environments providing ground truth. Claude's 72.5% OSWorld score reflects genuine agentic capability — operating spreadsheets, browsers, terminals in sequence. The environment forces genuine capability exposure.

ARC-AGI-2/3: Novel Problem Solving
These benchmarks test abstraction, exploration, and adaptation in environments that cannot be memorized. The 0% score from GPT-4.5 demonstrates that scale and training data coverage do not produce general reasoning.

Self-Reported Metrics: Vendor Grading
Hallucination rates, safety metrics, and other vendor-reported numbers represent a different category — basically self-grading. Without independent validation, they are marketing claims, not measurements.

The Benchmark Divergence: Same Models, Radically Different Scores

Frontier model performance varies by 0-94% depending on which benchmark is used, revealing fundamental measurement inconsistency

MMLU	Model	OSWorld	ARC-AGI-2	HumanEval	What It Suggests
88%+	GPT-4.5/5.2	38.2%	0%	~93%	Scale without reasoning
~90%	Claude Opus 4.5/4.6	72.5%	Not reported	93.8%	Strong practical capability
~88%	GLM-5	Not reported	Not reported	94.2%	Benchmark competitive, real-world TBD
~85%	Llama 4	Not reported	0%	~88%	Scale without reasoning

Source: ARC Prize, Anthropic, OpenAI, Zhipu benchmark releases Feb 2026

Circular Contamination: The Training Data Trap

The synthetic data revolution compounds the benchmark credibility crisis. 70% cost reduction in synthetic data production and 75% of training data projected synthetic by 2026 creates a circular data problem:

Model A generates content published to the web
Model B trains on web data containing Model A's output
Model B's outputs are published and used in Model C's training
Each generation compresses the distribution, eliminating tail knowledge

Ahrefs' April 2025 analysis found 74.2% of new web content is AI-generated. This is not a future risk — every model training on web data today is already consuming synthetic data at contamination levels exceeding safety thresholds.

ICLR research shows even 1-in-1,000 synthetic samples can trigger strong model collapse, yet web contamination is orders of magnitude higher. This suggests that benchmark scores are being artificially inflated by circular data reinforcement while genuine reasoning diversity is being compressed.

ARC-AGI Benchmark Evolution: The Arms Race Against Memorization

How the ARC benchmark series has escalated to stay ahead of scale-driven pattern matching

Nov 2019ARC-AGI-1 Published

Static grid puzzles; initial AI SOTA ~4%, humans ~84%

Dec 2024ARC-AGI-1 'Solved'

o1/o3 test-time compute reaches 85%+ via engineered refinement loops

Jan 2025ARC-AGI-2 Launch

Contamination-resistant; GPT-4.5 and Llama 4 score 0%

Jan 2026ARC Prize 2025 Results

Best score: 24% on private set after 15,154 entries from 1,455 teams

Feb 2026ARC-AGI-3 Preview

Interactive environments: 100% human success, near-0% AI progress

Mar 2026ARC-AGI-3 Launch

1,000+ levels, 150+ environments, $600K+ prize pool, learning efficiency metric

Source: ARC Prize announcements and technical reports 2019-2026

Market Implications: $100B+ Built on Questionable Metrics

Benchmark scores drive capital allocation at scale:

GLM-5's competitive HumanEval scores contributed to Zhipu's 28.7% stock surge
Claude's OSWorld trajectory justified the Vercept acquisition and Anthropic's agent strategy
Apple-Google deal was partially predicated on Gemini's benchmark performance
Enterprise procurement decisions rely on MMLU, HumanEval, and other easily gameable benchmarks

When ARC-AGI-3 launches March 25 with interactive environments showing near-0% AI performance on tasks humans find easy, it will create a narrative problem. The four frontier labs reporting ARC-AGI scores (Anthropic, Google DeepMind, OpenAI, xAI) face a choice: publish embarrassing scores or stop reporting ARC-AGI results and face questions about why.

The Interactive Benchmark Correction

ARC-AGI-3's design uses interactive environments with no instructions, discover-through-exploration, and formal learning-efficiency metrics comparing AI to human baselines. By measuring actions-per-goal rather than success rate, it creates a benchmark that rewards genuine adaptation over memorization.

But interactive benchmarks are expensive and complex. The $600K+ prize pool represents substantial investment. Most enterprise AI procurement will continue relying on cheaper, more easily computed benchmarks because interactive evaluation is not yet available at scale.

This creates a gap period (6-18 months) where decision-makers continue using dubious benchmarks while knowing those benchmarks are increasingly unreliable.

The Practical Capability Reality

Despite benchmark theatre, practical capability exists within a bounded envelope:

Claude 72.5% OSWorld + 94% Pace insurance: Genuine task completion in known domains
GPT-5.2 87% WebVoyager: Strong web-based automation
GLM-5 77.8% SWE-bench: Competitive coding capability

These scores reflect real utility — enterprises can deploy these systems for constrained task domains and achieve ROI. The limitation is not that AI is incapable; it is that benchmarks cannot distinguish between "capable at known tasks" and "capable at novel reasoning."

The Contrarian View

The bull case: Practical capability is what matters. Claude's 94% on Pace insurance generates real ROI regardless of ARC-AGI-3 scores. Enterprises do not need AGI — they need reliable task completion, which current models deliver.

The bear case: The gap between practical and reasoning benchmarks may narrow as test-time compute techniques crack ARC-AGI. The 30-day preview is limited data. And the industry's $100B+ capital allocation continues driven by metrics of demonstrably decreasing reliability.

What This Means for Practitioners

For model selection: Treat benchmark scores as necessary but insufficient. Prioritize environment-specific evaluation (OSWorld for desktop automation, SWE-bench for coding) over abstract benchmarks (MMLU, HumanEval).

For procurement: Ask vendors for practical task completion metrics in your domain, not general reasoning scores. If they only cite MMLU and HumanEval, they are hiding their actual performance on your problems.

For capability planning: Expect ARC-AGI-3 results to trigger re-evaluation of 'reasoning' claims across the industry within 3-6 months of March 25 launch. Factor this into your model selection timeline.

Outlook: The Evaluation Dark Age

We are entering an evaluation dark age where no single benchmark credibly measures what matters, and the industry's capital allocation continues driven by metrics of decreasing reliability. The correction will come when ARC-AGI-3 launches and exposes the reasoning gap publicly.

The labs that win will be those transparent about benchmark limitations and focused on practical task completion rather than abstract reasoning claims.