Open-Source Multimodal Parity Arrives: 9B Models Beat Proprietary Giants While Formal Verification Gets 15x Cheaper

Qwen 3.5 9B beats Gemini 2.5 Flash-Lite on Video-MME (84.5 vs 74.6) and GPT-5-Nano on MMMU-Pro (70.1 vs 57.2). Mistral's Leanstral achieves formal verification at 15x lower cost than Claude Sonnet. Open-source dominates specific domains while proprietary retains leads in coding and reasoning.

TL;DRBreakthrough 🟢

•Qwen 3.5 9B exceeds Gemini 2.5 Flash-Lite on Video-MME (84.5 vs 74.6) and GPT-5-Nano on MMMU-Pro (70.1 vs 57.2) -- nine-billion-parameter open-source beating proprietary multimodal leaders
•Multimodal parity is architectural: Qwen's early fusion (native multimodal tokens from training) vs US industry's grafted vision encoders creates permanent differentiation
•Mistral's Leanstral formal verification at pass@2 FLTEval: 26.3 points for $36 vs Claude Sonnet's 23.7 for $549 -- 15x cost advantage, 6.1% better performance
•Geopolitical specialization: Chinese labs (Alibaba, Zhipu) dominate multimodal, European labs (Mistral) dominate verification, US labs (OpenAI, Anthropic) dominate reasoning and coding
•Frontier is now multi-dimensional: open-source leads some capability domains, proprietary leads others. Single-model strategies become a source of capability gaps, not simplification

open-sourcemultimodalQwenMistralformal verification4 min readMar 22, 2026

High Impact⚡Short-termFor video/document workflows: deploy Qwen 3.5 or GLM-4.5V with Apache 2.0 (free). For formal verification: Leanstral at 15x cost savings is production-viable. For coding/reasoning: proprietary models (Claude Opus 4.6, GPT-5.4) still lead. Match model to task, not brand.Adoption: Immediate for multimodal and formal verification. Qwen 3.5 and Leanstral are Apache 2.0 and available now. Self-hosting infrastructure mature. Enterprise adoption limited by compliance documentation gap.

Cross-Domain Connections

Qwen 3.5 9B beats Gemini 2.5 Flash-Lite on Video-MME (84.5 vs 74.6)→GPT-5.4 leads computer use at 75% OSWorld surpassing human 72.4%

Open-source leads multimodal understanding while proprietary leads agentic computer use. Frontier is multi-dimensional, not a single leaderboard.

Leanstral beats Claude Sonnet at 15x lower cost on FLTEval pass@2→Meta semi-formal reasoning improves patch verification to 93% accuracy

Full code verification spectrum now open-source viable: semi-formal reasoning for daily code review + Leanstral for formal proofs in CI/CD.

GLM-4.5V leads 41 multimodal benchmarks with 3D-RoPE→Mistral Small 4 unifies reasoning + multimodal + coding in 6B active params

Open-source specialization diverging: Chinese labs optimize multimodal perception, European labs optimize verification and efficiency. US open-source gap is strategically significant.

Key Takeaways

Qwen 3.5 9B exceeds Gemini 2.5 Flash-Lite on Video-MME (84.5 vs 74.6) and GPT-5-Nano on MMMU-Pro (70.1 vs 57.2) -- nine-billion-parameter open-source beating proprietary multimodal leaders
Multimodal parity is architectural: Qwen's early fusion (native multimodal tokens from training) vs US industry's grafted vision encoders creates permanent differentiation
Mistral's Leanstral formal verification at pass@2 FLTEval: 26.3 points for $36 vs Claude Sonnet's 23.7 for $549 -- 15x cost advantage, 6.1% better performance
Geopolitical specialization: Chinese labs (Alibaba, Zhipu) dominate multimodal, European labs (Mistral) dominate verification, US labs (OpenAI, Anthropic) dominate reasoning and coding
Frontier is now multi-dimensional: open-source leads some capability domains, proprietary leads others. Single-model strategies become a source of capability gaps, not simplification

Open-Source Breaks Through: The Multimodal Revolution

March 2026 marks the moment open-source multimodal AI crossed from 'approaching parity' to 'exceeding proprietary' on production-relevant benchmarks. The significance is not that one open-source model beat one proprietary model on one benchmark -- it is the systematic pattern across multiple independent labs, model sizes, and capability domains.

The multimodal convergence is led by Chinese labs. Alibaba's Qwen 3.5 represents a generation-defining architectural shift: 'early fusion' of multimodal tokens (training on trillions of text+image+video tokens from scratch) rather than the US-dominant approach of grafting vision encoders onto text models.

The result: the 9B parameter Qwen 3.5 scores 84.5 on Video-MME with subtitles, exceeding Google's Gemini 2.5 Flash-Lite (74.6) by 10 points. On MMMU-Pro, the same 9B model scores 70.1 versus GPT-5-Nano's 57.2 -- a 22.5% advantage. All Apache 2.0 licensed.

Zhipu AI's GLM-4.5V leads 41 public multimodal benchmarks among open-source models. Its 3D-RoPE positional encoding innovation enables spatial reasoning capabilities that surpass text-first models on 3D understanding tasks. The GLM-4.1V-9B-Thinking variant achieves parity with 72B closed models on STEM, video, and long-document tasks -- an 8x parameter efficiency advantage.

Open-Source vs Proprietary: Benchmark Parity Map (March 2026)

Domain-by-domain comparison showing where open-source leads, matches, or trails proprietary models.

Domain	Winner	Open-Source Leader	Proprietary Leader
Video Understanding	Open-Source	Qwen 3.5 9B (84.5)	Gemini 2.5 Flash-Lite (74.6)
Multimodal QA	Open-Source	Qwen 3.5 9B (70.1)	GPT-5-Nano (57.2)
Formal Verification	Open-Source (15x)	Leanstral ($36/pass@2)	Claude Sonnet ($549/pass@2)
Coding (SWE-Bench)	Proprietary	N/A	Claude Opus 4.6 (80.8%)
Computer Use	Proprietary	N/A	GPT-5.4 (75.0%)
Knowledge Work	Proprietary	N/A	GPT-5.4 (83% GDPval)

Source: Compiled from Qwen AI, Mistral, OpenAI, Anthropic benchmark data

Formal Verification Gets 15x Cheaper: The Leanstral Moment

Mistral's Leanstral is the first open-source Lean 4 formal verification agent with Apache 2.0 weights. Its production significance: at pass@2 on FLTEval, Leanstral scores 26.3 versus Claude Sonnet's 23.7 -- at $36 versus $549. That is a 15x cost advantage while outperforming on accuracy.

Even at pass@16 where Claude Opus 4.6 wins (39.6 vs 31.9), Leanstral costs $290 versus $1,650 -- a 5.7x advantage. This makes formal verification economically viable in CI/CD pipelines for the first time. Applications include chip design verification, cryptographic proof validation, and safety-critical system certification -- domains where correctness is worth paying for but $1,650 per evaluation is not.

The verification is not approximate. Meta's semi-formal reasoning paper (arXiv:2603.01896) provides the theoretical bridge: structured reasoning templates improve patch verification from 78% to 93% accuracy and code QA by 9 percentage points without model retraining. Combined with Leanstral's formal verification at 15x lower cost, the full verification spectrum (informal reasoning -> semi-formal -> full formal proof) is now available in open-source at production-viable costs.

Geopolitical Lines: Each Region's Labs Excel in Different Domains

The geopolitical dimension is now impossible to ignore. Chinese labs (Alibaba, Zhipu) dominate multimodal. European labs (Mistral) dominate formal verification. US labs (OpenAI, Anthropic) dominate reasoning and coding benchmarks (SWE-Bench 80.8% for Claude Opus 4.6, GDPval 83% for GPT-5.4).

Capability differentiation is emerging along geopolitical lines: each region's AI labs excel in the capability domains that their respective industrial bases prioritize. China's manufacturing economy benefits from visual quality control and document processing (multimodal). Europe's regulatory and engineering tradition benefits from formal verification. The US service economy benefits from coding and knowledge work.

This is not accident. This is specialization following comparative advantage.

What Benchmarks Miss: Enterprise Readiness and Compliance

The contrarian case: benchmark parity does not equal production parity. Enterprise deployment requires SLAs, support contracts, security audits, and compliance documentation that open-source models do not provide. The proprietary moat is not capability but enterprise readiness -- and this moat is widening as EU AI Act compliance requirements create documentation burdens that no open-source project currently addresses.

The 15x cost advantage of Leanstral evaporates if enterprises must hire compliance teams to deploy it. This is the critical gap between benchmark parity and production adoption: the documentation and governance overhead that accompanies enterprise deployment.

What the benchmarks miss: GPT-5.4's 1.05M token context window and native computer use (75% OSWorld, surpassing human 72.4%) represent capabilities where no open-source model competes. The frontier is not a single point but a surface: open-source leads some dimensions, proprietary leads others.

What This Means for Practitioners

Match model selection to task domain, not brand:

For video understanding, document processing, and OCR workflows: Deploy Qwen 3.5 or GLM-4.5V with Apache 2.0 -- they outperform proprietary alternatives at zero API cost. Self-hosting infrastructure (vLLM, TGI) is mature and supports both models at production scale.

For formal verification in CI/CD: Leanstral at 15x lower cost than Sonnet is production-viable now. Integrate into your verification pipeline immediately -- this is the highest-ROI substitution available in open-source AI.

For coding and agentic computer use: Proprietary models (Claude Opus 4.6, GPT-5.4) still lead. The benchmark gap is real: 80.8% SWE-Bench for Claude vs no open-source equivalent. For mission-critical code generation, this matters.

For knowledge work and reasoning: GPT-5.4 leads at 83% GDPval. If your use case is pure knowledge extraction (research synthesis, document analysis), the proprietary advantage is measurable.