Video Understanding Is the New Moat: Google Leads at 87.6%, Open-Source Trails by 13 Points

MMLU is dead as a differentiator (88-93% saturation). Video-MMMU replaced it—Google's Gemini 3 Pro leads at 87.6%, with competitors at 60-80%. Meta's UniBench reveals scaling improves perception but NOT reasoning. Meanwhile GLM-4.1V-9B achieves 72B performance at 9B parameters, proving efficiency exists but the proprietary-open video gap remains structural.

TL;DRNeutral ⚪

•MMLU is officially retired as a differentiator: 88-93% saturation across all frontier models makes sub-percentage-point differences meaningless for real-world capability assessment
•Video-MMMU has replaced MMLU as the capability frontier: Google's Gemini 3 Pro leads at 87.6%, with GPT-5.2 Pro at 83% and most competitors at 60-78%—a meaningful spread that reveals genuine capability gaps
•Meta's UniBench finding is critical: scaling model size and data improves perceptual tasks (object detection, scene description) but does NOT improve visual reasoning (spatial understanding, compositional logic, counting)—suggesting transformers may have fundamental architectural limits
•The proprietary-open video gap (13-17 percentage points) is wider than the text gap because video data is a structural moat: YouTube (Google), Instagram (Meta), and Youku (Alibaba) cannot be replicated via web scraping
•GLM-4.1V-9B achieves 72B-equivalent performance through reasoning-augmented training, proving efficiency matters more than raw parameters for multimodal reasoning—but efficiency gains alone cannot close the video data moat gap

video-understandingmultimodal-aibenchmarksgeminimmmu5 min readMar 11, 2026

Key Takeaways

MMLU is officially retired as a differentiator: 88-93% saturation across all frontier models makes sub-percentage-point differences meaningless for real-world capability assessment
Video-MMMU has replaced MMLU as the capability frontier: Google's Gemini 3 Pro leads at 87.6%, with GPT-5.2 Pro at 83% and most competitors at 60-78%—a meaningful spread that reveals genuine capability gaps
Meta's UniBench finding is critical: scaling model size and data improves perceptual tasks (object detection, scene description) but does NOT improve visual reasoning (spatial understanding, compositional logic, counting)—suggesting transformers may have fundamental architectural limits
The proprietary-open video gap (13-17 percentage points) is wider than the text gap because video data is a structural moat: YouTube (Google), Instagram (Meta), and Youku (Alibaba) cannot be replicated via web scraping
GLM-4.1V-9B achieves 72B-equivalent performance through reasoning-augmented training, proving efficiency matters more than raw parameters for multimodal reasoning—but efficiency gains alone cannot close the video data moat gap

The Benchmark Lifecycle Accelerates

MMLU, the benchmark that defined LLM intelligence for three years (2023-2025), is now saturated at 88-93% across all frontier models. Performance differences in this range are within noise — a model scoring 91% is not meaningfully distinguishable from one scoring 89% on real-world tasks. Researchers have identified that many MMMU questions can be answered without the associated visual content, meaning the benchmark measured text reasoning, not multimodal capability.

The replacement cycle has produced MMMU-Pro (3,460 questions, 14 disciplines, 10-choice format vs 4-choice to eliminate guessing) and Video-MMMU (900 videos, 5 difficulty levels requiring simultaneous audio + visual + text reasoning). These successor benchmarks reveal meaningful capability tiers that MMLU had flattened.

Benchmark Lifecycle: Saturation and Succession

Accelerating benchmark saturation cycle from MMLU to Video-MMMU

2023-01MMLU Becomes Standard

88-93% range defines 'frontier' for LLMs

2023-11MMMU Launched

First major multimodal evaluation benchmark

2025-10Video-MMMU Goes Live

900 videos, 5 difficulty levels — reveals meaningful capability gaps

2026-01MMLU Declared Saturated

88-93% across all frontier models; retired as differentiator

2026-03MMMU-Pro + Video-MMMU as New Standard

62-87.6% range provides real differentiation

Source: Research community consensus, benchmark papers

Google's Structural Advantage in Video Understanding

Gemini 3 Pro leads Video-MMMU at 87.6%, with GPT-5.2 Pro (83%) and Gemini 2.5 Pro (80%) forming a second tier. Most competitors cluster at 60-78%. On MMMU-Pro, Gemini 3 Pro leads at 81%, with GPT-5.2 Pro at 76% and the pack trailing at 62-74%.

Google's advantage is architectural, not accidental. Gemini was designed as a natively multimodal model from inception (December 2023), processing image, video, audio, and text through unified attention from the ground up. Other frontier models (GPT-4o, Claude, Llama 4) added multimodal capability through post-hoc integration — fine-tuning text models on visual data. The native vs retrofitted distinction compounds over generations.

Additionally, Google's proprietary TPU infrastructure uses different memory subsystems than NVIDIA GPUs. This architectural difference matters in the context of the HBM shortage: Google can train video-intensive multimodal models on TPUs without competing for HBM3E allocation, while NVIDIA-dependent labs face $400k+/rack premiums and 6-12 month HBM lead times.

Video-MMMU Benchmark: Frontier Model Rankings (March 2026)

Performance comparison showing Google's structural lead in video understanding

Source: Awesome Agents leaderboard, benchmark papers

The Open-Source Gap Is Wider in Video Than Text

The proprietary-open gap on MMMU-Pro is 13-17 percentage points (Gemini 3 Pro 81% vs Llama 4 Maverick VL 64%, DeepSeek VL 3 62%). On text-only benchmarks, this gap has largely closed — open-source models routinely match proprietary on MMLU, HumanEval, and SWE-bench.

The video gap persists because multimodal training requires massive curated image-video-text datasets. Google has YouTube (500+ hours uploaded per minute), Google Search Images, and internal data pipelines. Meta has Instagram and Facebook video. Open-source labs lack equivalent data at scale. This is a structural advantage, not a solvable engineering problem — you cannot replicate YouTube with better training recipes.

Qwen3-VL-235B (Alibaba) is the notable exception: it rivals Gemini 2.5 Pro and GPT-5 on many multimodal benchmarks, benefiting from Alibaba's own video data (Youku) and massive compute infrastructure. The Chinese open-source multimodal story is stronger than the Western open-source story because Chinese tech giants have data + compute + incentive to open-source.

Meta's UniBench Finding: Scaling Doesn't Fix Reasoning

Meta's UniBench study — testing ~60 vision-language models across 50+ benchmarks — produced the most important research finding in multimodal AI for 2026: scaling model size and data volume improves perceptual capability (object detection, scene description, image classification) but does NOT improve visual reasoning (spatial understanding, compositional reasoning, object counting under adversarial conditions).

Even frontier models fail at tasks like digit recognition, counting objects in cluttered scenes, and understanding spatial relationships between objects. These are tasks that preschool children handle trivially. The implication: the next multimodal breakthrough requires architectural innovation (possibly along AMI Labs' JEPA direction), not just scaling the current approach.

GLM-4.1V-9B from Zhipu AI (Tsinghua-affiliated) demonstrates the inverse: a 9B-parameter model achieving performance comparable to 72B models on STEM and video tasks through reasoning-augmented training. This suggests that training methodology (reasoning chains, test-time compute, structured prompting) matters more than raw parameter count for multimodal reasoning — an efficiency insight consistent with the memory wall forcing function.

Implications for DeepSeek V4 and Competitive Dynamics

DeepSeek V4's expected multimodal capabilities (native video understanding, ultra-HD image comprehension) will be measured against this Video-MMMU/MMMU-Pro frontier. The expected pricing ($0.14/1M tokens) creates an interesting dynamic: even if V4 trails Gemini 3 Pro by 10-15 points on Video-MMMU, the 20:1 cost advantage may make it the practical choice for video-processing workloads where accuracy requirements are below 90%.

This represents a strategic shift: when Google owns the perception-quality leader position and has an inherent data advantage that cannot be overcome through engineering alone, competitive entry points for other labs shift to cost and efficiency, not head-to-head capability.

What This Means for Practitioners

ML engineers evaluating models for video-processing workloads should benchmark on Video-MMMU and MMMU-Pro, not MMLU. MMLU performance tells you almost nothing about video understanding capability at this point. For cost-sensitive video inference, evaluate GLM-4.1V-9B and Qwen3-VL-235B as alternatives to proprietary models—they may meet accuracy requirements at a fraction of the cost.

Build evaluation pipelines that test visual reasoning (spatial, compositional, counting) separately from perception. Meta's UniBench finding shows that these capabilities scale differently. If your application requires reasoning over video, proprietary models (Gemini 3 Pro, GPT-5.2 Pro) have a genuine advantage that smaller models cannot overcome through scale. If your application is perception-heavy (scene description, object detection), smaller efficient models become viable.

For teams building video understanding systems: recognize that the proprietary-open gap in video is wider and more permanent than in text. Chinese teams (using Youku data) and hyperscalers (with YouTube/Instagram access) will maintain leads. The practical path forward for non-data-owning labs is either licensing from leaders, or targeting sub-perception-class tasks where UniBench has shown reasoning can partially compensate for scale.