Benchmark Verification Has Collapsed: ARC-AGI Doubles in 90 Days, DeepSeek Claims Unverified

Gemini 3.1 Pro's ARC-AGI-2 score jumped 46 percentage points in 90 days, DeepSeek V4 benchmark claims remain unverified, and Claude Mythos evaluation rests on a single external body. The verification infrastructure has decoupled from model release velocity.

TL;DRNeutral ⚪

•Gemini 3.1 Pro's ARC-AGI-2 score nearly tripled from 31.1% to 77.1% in approximately 90 days — a velocity without historical precedent in AI benchmarking
•DeepSeek V4's pre-launch claims (90% HumanEval, 80%+ SWE-Bench) are explicitly internal and unverified, but market signal treats them as fact during the verification lag window
•Claude Mythos was evaluated by UK AISI as the sole external body, with zero independent cross-validation from US AISI, EU AI Office, or academic labs
•Labs now openly select evaluation sets — Google omitted domains where Claude and GPT-5.4 lead; each vendor foregrounds their own home-field benchmarks
•Enterprise procurement is shifting from public leaderboards to internal eval harnesses, creating a two-tier evaluation system: visible (noise) and proprietary (signal)

benchmarksARC-AGIGemini 3.1 ProDeepSeek V4evaluation5 min readApr 17, 2026

High Impact⚡Short-termEnterprise buyers should stop relying on public leaderboards for model selection and build internal eval harnesses. Public benchmarks are now vendor-optimized marketing signals rather than objective comparison data. Procurement decisions based on HELM or LMSys Arena will increasingly misprice model capability.Adoption: Immediate — internal eval adoption accelerating across enterprises in Q2 2026 as public benchmark selection reaches absurdity

Key Takeaways

Gemini 3.1 Pro's ARC-AGI-2 score nearly tripled from 31.1% to 77.1% in approximately 90 days — a velocity without historical precedent in AI benchmarking
DeepSeek V4's pre-launch claims (90% HumanEval, 80%+ SWE-Bench) are explicitly internal and unverified, but market signal treats them as fact during the verification lag window
Claude Mythos was evaluated by UK AISI as the sole external body, with zero independent cross-validation from US AISI, EU AI Office, or academic labs
Labs now openly select evaluation sets — Google omitted domains where Claude and GPT-5.4 lead; each vendor foregrounds their own home-field benchmarks
Enterprise procurement is shifting from public leaderboards to internal eval harnesses, creating a two-tier evaluation system: visible (noise) and proprietary (signal)

The Benchmark Velocity Crisis

The defining feature of April 2026's evaluation landscape is not which model is best — it is that the measurement infrastructure cannot keep pace with the models it is supposed to measure. Three data points from this week demonstrate the collapse of an orderly benchmarking regime.

First, the velocity inflection. Gemini 3.1 Pro's ARC-AGI-2 score jumped from Gemini 3 Pro's 31.1% (November 2025) to 77.1% (February 2026) — a 46 percentage-point absolute gain, 2.5x relative, in approximately 90 days. No historical benchmark in AI has moved this fast. MMLU progressed from ~45% (GPT-3) to ~90% (GPT-4) over roughly 3 years; SWE-Bench advanced from single digits to ~80% over ~18 months. When a benchmark doubles in a quarter, either genuine capability inflection has occurred or the benchmark has saturated — and there is no external body with the compute budget or task-authorship velocity to distinguish between the two.

This creates a market timing problem for labs. Google's Gemini 3.1 Pro model card foregrounds ARC-AGI-2 (77.1%) and GPQA Diamond (94.3%); DeepSeek's V4 pre-release materials emphasize HumanEval (90%) and SWE-Bench (80%+); Claude Mythos documentation highlights specialized cyber-security tasks (73% expert CTF). Each lab has identified the benchmarks where it leads and created the narrative around them. When benchmark velocity exceeds verification cadence, this selective framing becomes the only signal the market receives.

The Verification Lag as Marketing Window

DeepSeek V4's pre-launch claims illustrate the second dimension of the crisis: verification latency as a competitive advantage. DeepSeek V4's internal claims — 90% HumanEval, 80%+ SWE-Bench Verified, 97% Needle-in-a-Haystack at 1M tokens — are explicitly flagged as aspirational by sympathetic coverage (NxCode, WaveSpeed). DeepSeek R1's benchmark claims were independently verified only weeks after launch, during which window the market treated the numbers as fact and NVIDIA's stock reacted accordingly.

Labs have learned that the verification lag IS the marketing window. Headline numbers drive narrative momentum for 4-8 weeks before academic eval infrastructure (Livebench, Artificial Analysis, Scale HELM) publishes independent results. By that point, the market has already priced in the capability, and enterprise procurement decisions have been made. This is not new — benchmark cherry-picking has been endemic since GPT-3 — but it has reached the point where the signal value is inverted: which benchmarks a model card DOES NOT REPORT is now more informative than which ones it does.

Monopsony in Safety Evaluation

The third dimension is institutional concentration. Claude Mythos was evaluated by UK AISI — and essentially only UK AISI, given that the Project Glasswing consortium of 40+ entities operates under defensive-use restrictions that preclude publishing independent capability red-teams. AISI reported 73% success on expert CTF tasks and 22-of-32 average steps on TLO network takeover. These numbers have had zero independent cross-validation.

The US AISI, the EU AI Office, and academic labs all lack access to Mythos for independent evaluation. When a single government body is the entire external evaluation corpus for a model Anthropic itself classifies as ASL-4 (extremely dangerous), benchmark legitimacy becomes a question of institutional trust rather than methodology. This mirrors the evaluation monopsony that existed in prior eras: OpenAI's Evals were the only source for GPT-4 capability data until weeks after release. But the scale is larger — Mythos's capability is higher-stakes, and the evaluation body's exclusive access means there is no adversarial review.

Leaderboard Shopping Ends When Signals Decorrelate

SmartScope's April 2026 critique of Gemini 3.1 Pro's '13 out of 16 wins' framing shows labs now openly select their evaluation set. Google omitted creative writing and agentic computer use (where Claude and GPT-5.4 lead respectively) from its headline comparison. This is not new, but it has reached the point where leaderboard shopping — moving models between different eval sets to find advantageous comparisons — has exhausted its signal value.

When three frontier labs have three completely different benchmark signatures — Gemini 3.1 Pro leading on ARC-AGI-2 and multimodal video; DeepSeek V4 claiming HumanEval and coding; Claude Mythos dominant in cyber-security capability — the leaderboards that are supposed to produce comparable signals have become decorrelated noise. Artificial Analysis's TTS leaderboard (Elo 1,211 for Gemini, ~1,280 for ElevenLabs) remains useful because it is narrow-domain and actively adjudicated. But general-purpose leaderboards like LMSys Arena or HELM have become too slow relative to release cadence to serve as purchasing criteria.

What This Means for Practitioners

The pragmatic response is already emerging in enterprise procurement: teams are building internal eval harnesses tied to their actual workflows, treating public benchmarks as release-note signal rather than purchasing criteria. This is expensive — eval engineering is a specialized skill that junior AI teams lack — which creates a second-order effect: larger enterprises with dedicated eval teams will increasingly pick models against evidence their competitors cannot see.

For ML engineers evaluating frontier models, the implication is clear: do not trust public leaderboards as the source of truth for comparative capability. Benchmark velocity (models improving 2.5x in 90 days) and selective framing (each lab's home-field benchmarks as headline numbers) have made the public evaluation infrastructure a noise channel rather than a signal channel. The durable signal comes from task-specific internal evaluation on your own workload — time-consuming, but the only method that filters out vendor-optimized benchmark selection.

For procurement teams, expect 2026-2027 to bifurcate evaluation into two tiers: public benchmarks as marketing signal, internal evals as decision criteria. Models with substantive external evaluations (Anthropic RSP documentation, OpenAI evals at release) have some legitimacy, but even these should be treated as baseline assumptions rather than final criteria. The benchmark infrastructure that existed in 2024 — HELM, LMSys Arena, HuggingFace leaderboards — has served its purpose and is now too slow relative to model release velocity to serve as market-clearing mechanism.