Benchmark Credibility Crisis: MMLU Saturated, SWE-bench Gamed, Chinese Models Claim Parity

MMLU saturated at 90%+, GSM8K shows 13% contamination drops, LMArena gaming awards ~100 Elo points per strategic submission. DeepSeek V4's unverified 80%+ SWE-bench claim and GLM-5's HLE 50.4 arrive in a broken evaluation ecosystem. AIRS-Bench (59.3% task completion) reveals the truth.

TL;DRNeutral ⚪

•MMLU saturated at 90%+; top leaderboards now exclude it entirely. Three-year era of MMLU-based model comparison has ended as a meaningless benchmark
•Data contamination: GSM8K math problems show 13% accuracy drops on contamination-free versions. LLMs are memorizing test distributions, not generalizing reasoning
•LMArena gaming: Major labs submit 10+ model variants, test privately, publish only favorable results—gaining ~100 Elo points per strategic submission
•DeepSeek V4 claims 80%+ SWE-bench Verified from internal testing only (unverified). SWE-bench tests only 12 popular Python repos certainly in training data
•GLM-5 HLE 50.4 (with tool access) is more trustworthy than SWE-bench claims because HLE is contamination-resistant—but 'with tool access' qualifier means score may reflect tool quality, not reasoning

benchmarksevaluationmmluswe-benchdeepseek5 min readFeb 19, 2026

Key Takeaways

MMLU saturated at 90%+; top leaderboards now exclude it entirely. Three-year era of MMLU-based model comparison has ended as a meaningless benchmark
Data contamination: GSM8K math problems show 13% accuracy drops on contamination-free versions. LLMs are memorizing test distributions, not generalizing reasoning
LMArena gaming: Major labs submit 10+ model variants, test privately, publish only favorable results—gaining ~100 Elo points per strategic submission
DeepSeek V4 claims 80%+ SWE-bench Verified from internal testing only (unverified). SWE-bench tests only 12 popular Python repos certainly in training data
GLM-5 HLE 50.4 (with tool access) is more trustworthy than SWE-bench claims because HLE is contamination-resistant—but 'with tool access' qualifier means score may reflect tool quality, not reasoning
AIRS-Bench reveals the gap: agents achieve 59.3% valid submission rate on real research tasks vs 80%+ on SWE-bench—a 20+ percentage point disconnect between isolated benchmarks and end-to-end reliability

The Three Failures of Current Benchmarks

First: MMLU Saturation

Top models now score 90%+ on MMLU, leading major leaderboards to exclude it entirely. The benchmark that defined model comparison for three years has become uninformative. This follows the exact trajectory of GLUE in NLP (saturated circa 2021) and creates a vacuum in the standard evaluation stack.

Second: Data Contamination

Research on GSM8K math problems found up to 13% accuracy drops on contamination-free versions—meaning models are memorizing test distributions, not learning mathematical reasoning. The LLM Decontaminator tool (using embedding similarity + GPT-4 judgment) has caught paraphrased test leaks in MMLU, GSM-8K, and HumanEval that simpler n-gram matching missed. SWE-bench tests bug-fixing on only 12 popular Python repositories, all of which are certainly in training data.

Third: Strategic Gaming

LMArena (Chatbot Arena) investigation revealed that major labs were submitting 10+ model variants per model, testing privately, and publishing only favorable results—gaining approximately 100 Elo points per strategic submission. This is not a fringe practice; it is standard operating procedure.

The Chinese Open-Source Intersection

Into this credibility vacuum arrive two major Chinese open-source models with headline claims:

DeepSeek V4 claims 80%+ SWE-bench Verified, which would match Claude Opus 4.5's 80.9% lead. But this claim is from internal testing only. SWE-bench tests on 12 Python repositories likely in V4's training data. The benchmark gaming research suggests the true performance on novel codebases could be 10-15% lower.

GLM-5 scores 50.4 on Humanity's Last Exam (HLE) with tool access, exceeding Claude Opus 4.5 (43.4) and GPT-5.2 (45.8). HLE was specifically designed as contamination-resistant—it uses novel questions unlikely to appear in training data. However, the 'with tool access' qualifier introduces a confound: the score may reflect tool quality (web search, code execution) rather than pure reasoning. GLM-5's BrowseComp score of 75.9 (#1 open-source) reinforces this pattern—it is strongest on tasks where tool use matters most.

The critical point: HLE is a more trustworthy benchmark than MMLU or SWE-bench precisely because it is designed to resist contamination. GLM-5's HLE score is therefore more credible than DeepSeek V4's SWE-bench claim. But the 'with tool access' qualification means the comparison is not apples-to-apples with models tested without tools.

The AIRS-Bench Signal: End-to-End Reality

Facebook Research's AIRS-Bench provides the most sober evaluation of where AI agents actually stand. Across 20 ML research tasks, agents exceeded human SOTA on only 4 tasks, averaged 24.1% normalized score, and had only 59.3% valid submission rate. This benchmark evaluates end-to-end task completion rather than isolated capability—and the results are dramatically worse than any model-specific benchmark suggests.

The 59.3% valid submission rate is particularly revealing when contrasted with SWE-bench scores above 80%. Models that reportedly solve 80% of coding problems in a controlled benchmark fail to even submit a valid answer 40% of the time on real research tasks. This gap—between benchmark performance and task completion—is the benchmark credibility crisis in a single number.

AI Benchmark Reliability Assessment (Feb 2026)

Comparison of major benchmarks by contamination risk, gaming risk, and current diagnostic value.

Status	Benchmark	Top Score	Gaming Risk	Diagnostic Value	Contamination Risk
Saturated / Excluded	MMLU	90%+	Medium	Low	High
Active but questioned	SWE-bench	80.9%	Medium	Medium	High (12 repos)
New / Active	HLE (tool access)	50.4	Medium (tool quality)	High	Low
New / Active	LiveBench	<70%	Low	High	Low (monthly refresh)
New / Active	AIRS-Bench	24.1% avg	Low	High	Medium (fixed tasks)

Source: Benchmark papers / LXT Blog / UC Strategies / AIRS-Bench

The Benchmark-to-Reality Gap

Key metrics showing the disconnect between isolated benchmark scores and real-world AI task completion.

80.9%

SWE-bench Top Score

Isolated coding tasks

59.3%

AIRS-Bench Task Completion

▼ End-to-end research

-13%

GSM8K Contamination Drop

▼ Clean test version

4x higher

AI Code Bug Rate vs Human

▼ Stanford HAI

Source: SWE-bench / AIRS-Bench / GSM8K research / Stanford HAI

The Emerging Evaluation Stack

The industry is responding with next-generation benchmarks: LiveBench (monthly refreshes, objective ground truth, current top models below 70%); LiveCodeBench (continuous coding problems from active competitions); ARC-AGI-2 (real-time constraints to resist memorization); METR (time-horizon benchmarks for long tasks). These share a design principle: dynamic evaluation that cannot be gamed by training-data memorization.

The practical implication for developers: benchmark scores from any lab (Western or Chinese) should be treated as marketing claims until independently verified on contamination-resistant evaluations. The 'trust stack' for model evaluation in 2026 should be: (1) LiveBench for general reasoning, (2) LiveCodeBench for coding, (3) HLE for frontier intelligence, (4) AIRS-Bench for agentic task completion. Static benchmarks (MMLU, GSM8K, original SWE-bench) are unreliable.

Contrarian View: Usage May Be the Best Benchmark

The benchmark crisis may be overstated. Models that score 90%+ on MMLU genuinely know a lot—the saturation reflects real capability, not just memorization. GSM8K contamination research shows 13% drops, but 87% of performance is still real reasoning. More importantly, the practical test is whether models solve user problems, not whether they pass academic benchmarks.

If developers find DeepSeek V4 productive in their daily coding work, the benchmark debate becomes academic. The market will evaluate models through usage, not through tests—and usage data (GitHub Copilot at 4.7M subscribers, Claude Code at 4% of commits) may be the most reliable benchmark of all.

What This Means for Practitioners

For ML Engineers Evaluating Models: Stop relying on MMLU, GSM8K, or single SWE-bench scores. Build internal evaluation suites with fresh, task-specific test cases that mirror your actual workloads. Use LiveBench and LiveCodeBench for general comparison. For coding tasks, measure end-to-end task completion rate (not just correctness on known repositories).

For Benchmarking Chinese Open-Source Models: Independently benchmark DeepSeek V4 and GLM-5 on YOUR codebase before committing to deployment. Don't rely on internal testing claims. GLM-5's HLE score is more credible than DeepSeek V4's SWE-bench claim, but you still need to validate tool quality and edge case handling in your specific domain.

For Product Managers: The benchmark credibility crisis has a silver lining: usage metrics (how many developers use your model in production, how often they ask for feature improvements, what error rates occur in real deployments) are more informative than benchmark scores. Invest in production monitoring and user feedback rather than chasing leaderboard positions.

Adoption Timeline: LiveBench and LiveCodeBench available now. AIRS-Bench code on GitHub for research teams. Internal evaluation infrastructure should be a Q2 2026 priority for any team deploying AI in production.