Open-Source Vision Models Beat Closed-Source on Benchmarks: The Last Moat Is Desktop Control

InternVL3-78B's 72.2 MMMU exceeds GPT-4o (69.9) and Claude 3.5 Sonnet (70.4). Combined with Qwen 3.5 (87.5 VideoMME) and GLM-5 (77.8% SWE-bench), open-source models achieve parity or superiority on every major benchmark except one: desktop computer-use, where Anthropic's 72.5% OSWorld remains unmatched. This benchmark-by-benchmark parity map reveals the only remaining proprietary moat is embodied agentic capability.

TL;DRBreakthrough 🟢

•Open-source models have achieved parity or exceeded closed-source on text reasoning, code generation, vision-language understanding, and video processing—a systematic shift, not cherry-picked comparisons
•InternVL3-78B: 72.2 MMMU (exceeds GPT-4o-latest at 69.9 and Claude 3.5 Sonnet at 70.4); InternVL3.5-8B achieves 73.4 MMMU with 10x fewer parameters
•Qwen 3.5: 87.8 MMLU-Pro (competitive with frontier text), 87.5 VideoMME, 76.4% SWE-bench Verified at $0.48/M tokens (31x cheaper than Claude Opus)
•GLM-5: 50.4% Humanity's Last Exam (beats Claude Opus 4.5), 77.8% SWE-bench Verified at $0.80-1.00/M tokens under MIT license
•The ONE benchmark where open-source does NOT achieve parity: OSWorld desktop computer-use, where Claude Sonnet 4.6 leads at 72.5% vs open-source best of 34.5%—a 2.1x gap that requires robotics-adjacent engineering, not just model scale

open-source vision modelsInternVL3Qwen 3.5benchmark parityvision-language models5 min readMar 2, 2026

Key Takeaways

Open-source models have achieved parity or exceeded closed-source on text reasoning, code generation, vision-language understanding, and video processing—a systematic shift, not cherry-picked comparisons
InternVL3-78B: 72.2 MMMU (exceeds GPT-4o-latest at 69.9 and Claude 3.5 Sonnet at 70.4); InternVL3.5-8B achieves 73.4 MMMU with 10x fewer parameters
Qwen 3.5: 87.8 MMLU-Pro (competitive with frontier text), 87.5 VideoMME, 76.4% SWE-bench Verified at $0.48/M tokens (31x cheaper than Claude Opus)
GLM-5: 50.4% Humanity's Last Exam (beats Claude Opus 4.5), 77.8% SWE-bench Verified at $0.80-1.00/M tokens under MIT license
The ONE benchmark where open-source does NOT achieve parity: OSWorld desktop computer-use, where Claude Sonnet 4.6 leads at 72.5% vs open-source best of 34.5%—a 2.1x gap that requires robotics-adjacent engineering, not just model scale

Benchmark-by-Benchmark Parity Achieved

For the first time in the history of large language models, open-source and open-weight models have achieved parity with or exceeded the best closed-source proprietary systems on every major standardized benchmark category:

Text Reasoning:

Qwen 3.5: 87.8 MMLU-Pro (competitive with GPT-5.2 and Claude Opus 4.6)
GLM-5: 50.4% Humanity's Last Exam (exceeds Claude Opus 4.5)

Code Generation:

GLM-5: 77.8% SWE-bench Verified
Qwen 3.5: 76.4% SWE-bench Verified

Vision-Language Understanding:

Video Comprehension:

Qwen 3.5: 87.5 VideoMME (native video processing via Conv3d patch embeddings)

The Exception:

Desktop Computer-Use (OSWorld): Claude Sonnet 4.6 at 72.5% vs open-source best Simular S2 at 34.5%—a 2.1x proprietary advantage

Open-Source vs Closed-Source Benchmark Parity Map (March 2026)

Systematic comparison showing open-source parity or lead on every benchmark except desktop computer-use

Gap	Parity?	Category	Best Closed	Best Open-Source
<1%	YES	Text Reasoning (MMLU-Pro)	GPT-5.2: ~88	Qwen 3.5: 87.8
+2.8%	OPEN LEADS	Code (SWE-bench)	Claude: ~75%	GLM-5: 77.8%
+2.3	OPEN LEADS	Vision (MMMU)	GPT-4o: 69.9	InternVL3: 72.2
+2.5	OPEN LEADS	Video (VideoMME)	Gemini: ~85	Qwen 3.5: 87.5
-38pp	NO	Desktop Use (OSWorld)	Claude: 72.5%	Simular S2: 34.5%

Source: OSWorld / MMMU / SWE-bench / VideoMME leaderboards + model announcements

The Pricing Inversion: 31x Cheaper, Equal Quality

The standard argument for closed-source AI models has been: 'Pay the premium because open-source cannot match the quality.' That argument is now empirically false across text, vision, video, and code.

Qwen 3.5 at $0.48/M tokens versus Claude Opus at $15/M creates a 31x gap that cannot be justified by quality differences that no longer exist on standardized evaluations.

For vision-language tasks (document analysis, image QA, chart interpretation), the economic case is overwhelming: pay 31x less for InternVL3-78B and lose <1% accuracy on MMMU (72.2 vs ~73 for frontier closed models).

The Critical Distinction: Understanding vs Control

The InternVL3-78B result illustrates both the achievement and the limitation of benchmark parity. InternVL3 scores 72.2 on MMMU, beating GPT-4o's 69.9. But MMMU tests static image understanding—diagrams, charts, documents. It does not test whether a model can USE a computer: clicking buttons, filling forms, navigating multi-tab workflows, operating legacy software through visual interfaces.

That capability—desktop computer-use—is what OSWorld measures. Claude's 72.5% represents a 2x lead over ANY other system, proprietary or open-source.

The architectural reasons matter. InternVL3's strength comes from native multimodal pre-training with Variable Visual Position Encoding (V2PE) and Mixed Preference Optimization (MPO). These innovations excel at visual UNDERSTANDING—comprehending what is in an image, reading charts, solving visual math.

Anthropic's Vercept technology excels at visual CONTROL—identifying UI elements, predicting interaction outcomes, executing multi-step desktop operations. This is closer to robotics (Ross Girshick's background is in object detection for robotic manipulation) than to vision-language modeling. Understanding and control are different capabilities with different training requirements.

The Ecosystem Compounding Advantage

InternVL3 uses Qwen2.5-72B as its language backbone, meaning InternVL3 is built ON TOP of the Qwen ecosystem. One lab's language model serves as another lab's vision-language foundation. This ecosystem-level composability—no Western closed-source model serves as a component in another lab's architecture—creates faster compounding.

When one lab's innovation becomes another lab's foundation, the entire system moves faster than any single lab can move alone. This is a structural advantage that compounds over time.

The Efficiency Curve Favors Edge Deployment

InternVL3.5-8B achieves 73.4 MMMU—exceeding the 78B version—with just 8 billion parameters. Qwen3.5-35B-A3B activates only 3 billion parameters while outperforming previous-generation 235B models. The efficiency curve means that frontier-quality vision-language understanding will run on smartphones and edge devices within 12 months.

Desktop computer-use cannot follow the same efficiency trajectory because it requires real-time multi-step interaction with low latency on complex workflows—a capability that favors cloud-deployed proprietary infrastructure.

This asymmetric erosion explains why Anthropic's OSWorld lead persists: it requires continuous cloud infrastructure that open-source ecosystems have not needed to optimize for.

The Security Dimension: Open-Source Liability vs Vendor Accountability

The n8n vulnerability pattern shows that agentic AI infrastructure is insecure. As open-source models become capable of powering agents through workflow orchestrators, the security exposure increases. An InternVL3-powered vision agent processing sensitive documents through a compromised n8n workflow has the same blast radius as any other agent.

But with MIT-licensed open-source components, there is no vendor to hold accountable for security incidents. This creates a risk-governance tradeoff: open-source offers cost advantages but reduces vendor accountability for security failures.

The Governance Asymmetry

The Pentagon's ban on Anthropic while Chinese open-source models face zero US oversight creates a regulatory vacuum. InternVL3-78B under Apache 2.0 license can be deployed in any US enterprise with no federal review. Claude, the most capable agent, navigates a governance minefield.

The 'most regulated' option is the most capable one. The 'least regulated' option is the cheapest. This governance asymmetry creates an incentive structure where regulated industries may choose open-source simply to avoid federal scrutiny.

The Contrarian Case: Benchmarks Are Not Production

MMMU, SWE-bench, and VideoMME test specific capabilities under controlled conditions. Real-world production quality depends on instruction following, safety, hallucination rates, long-context reliability, and integration quality—dimensions where closed-source models may retain advantages that benchmarks do not capture.

Additionally, InternVL3's self-reported benchmarks are not all independently verified. The 'open-source beats closed-source' narrative is partly an artifact of which benchmarks are selected for comparison. Cherry-picking favorable benchmarks distorts the narrative.

True production quality requires testing beyond standard benchmarks: latency, throughput, cost per successful inference, error handling, recovery from failures, and integration stability. Benchmarks are leading indicators, not guarantees.

What This Means for Practitioners

For vision-language understanding tasks: Open-source models are now the rational default. The cost savings are 15-31x with no quality penalty on standardized evaluations. InternVL3-78B and Qwen 3.5 are production-ready under Apache 2.0 / MIT licenses.

For text reasoning and code generation: GLM-5 and Qwen 3.5 provide frontier-equivalent performance at massive cost savings. Default to open-source unless your workload requires vendor SLAs or specific compliance certifications.

Reserve proprietary model spend for two use cases only:

Desktop computer-use agents: Anthropic at 72.5% OSWorld is the only option. No open-source alternative exists. This is the only benchmark category where proprietary still owns.
Safety-critical applications: When you need vendor accountability for security failures or compliance guarantees, proprietary models with vendor SLAs are the rational choice. Open-source absolves vendors of liability.

The 'build on open-source, deploy proprietary agents' pattern will dominate 2026 architecture: Use InternVL3 and Qwen for vision, reasoning, and code tasks. Use Claude for desktop automation. Let the open-source ecosystem handle commodity inference; reserve proprietary spend for irreplicable capabilities and institutional accountability.