Key Takeaways
- Claude's OSWorld score: 4.8x improvement in 16 months (14.9% to 72.5%), adding ~14 percentage points every 4 months — trajectory suggests human-level performance by Q4 2026
- Sonnet 4.6 achieves 94% on real insurance workflows — first production-grade vertical result for AI desktop automation, UiPath shares fell 3.6% on Vercept announcement
- Cost economics reversed: Sonnet 4.6 at ~$0.30-$1.50 per million tokens with caching is 10-50x cheaper than UiPath RPA licensing ($420-$1,680/robot/month)
- Anthropic's acqui-hire strategy (Vercept + Bun in 3 months) signals 6-12 month product timeline, not research timeline
- Dual improvement curve (higher accuracy + lower cost simultaneously) is structurally different from RPA's static cost model
The 16-Month Acceleration: From Research Curiosity to Production-Grade
Claude's computer use capability has improved at a quantitative rate that, if sustained, makes enterprise RPA economically obsolete within 12-18 months. The OSWorld benchmark data tells a precise story:
October 2024 (14.9%) → February 2025 (28.0%) → June 2025 (42.2%) → October 2025 (61.4%) → February 2026 (72.5%). This is a 4.8x improvement in 16 months. Each 4-month interval adds roughly 14 percentage points. At this rate, Claude reaches approximately 87% by June 2026 and approaches human-level (estimated at ~95%) by October 2026.
But OSWorld is a research benchmark. The Pace insurance result (94% accuracy on real insurance workflows) is a production signal demonstrating that AI computer use has crossed from "interesting demo" to "production-deployable."
Insurance: The Production Signal
Insurance is a $7.1 trillion global industry where 30-40% of back-office tasks are still manual or semi-automated via RPA. Claude Sonnet 4.6 achieves 94% accuracy on Pace's insurance computer use benchmark — navigating spreadsheets, filling multi-step web forms, interacting with legacy desktop applications, handling First Notice of Loss intake. These are real workflows, not synthetic benchmarks. The 94% result signals that AI has transitioned from prototype to production readiness in at least one regulated vertical.
The Acqui-Hire Acceleration: Vercept and Bun
Anthropic's $50M acquisition of Vercept brings Ross Girshick (R-CNN inventor, computer vision pioneer), Luca Weihs (AI2 reinforcement learning), and Kiana Ehsani (perception research) in-house. This is the second major acqui-hire in three months — Anthropic acquired Bun's development team in December 2025. Both acquisitions target the agentic infrastructure stack: Bun for runtime performance, Vercept for perception and interaction.
The pattern signals clear intent: Anthropic is building a vertical computer-use stack through rapid talent consolidation rather than organic hiring. Two strategic acquisitions in 90 days for agentic infrastructure suggests a 6-12 month product timeline, not a research timeline. The market is reading this correctly: UiPath shares dropped 3.6% on the Vercept acquisition announcement alone — before any product integration has occurred.
Cost Economics: 10-50x Cheaper Than RPA
Sonnet 4.6 achieves 94% insurance accuracy at $3/$15 per million input/output tokens — one-fifth the cost of Opus ($15/$75). With prompt caching providing 90% savings and batch processing 50% savings, the effective cost for repeated insurance workflows drops to approximately $0.30/$1.50 per million tokens.
Compare this to enterprise RPA licensing: UiPath charges $420/robot/month for attended automation and up to $1,680/robot/month for unattended. For high-volume insurance processing, the AI approach is already 10-50x cheaper than RPA, with higher accuracy on complex tasks. RPA's advantage is deterministic reliability — the same robot does the same thing every time. AI computer use is probabilistic, but the gap from 94% to production-grade reliability (99.9%+) is narrowing faster than the cost gap is widening.
The 70% reduction in token consumption and 38% accuracy improvement on filesystem tasks (Sonnet 4.6 vs 4.5) shows efficiency gains are compounding alongside capability gains. Each model generation is simultaneously more capable AND cheaper to run. This dual improvement curve is structurally different from RPA, where robots are static automations requiring manual maintenance when UIs change.
Claude Computer Use: The 16-Month Acceleration
Key milestones showing improvement from research curiosity to production-grade vertical automation capability.
Source: Anthropic, MLQ.ai, TechCrunch — October 2024 through February 2026
World Models: Digital Computer Use to Physical Simulation
Google's Project Genie adds a dimension: the transition from digital computer use (operating existing software) to physical world generation (creating navigable 3D environments). If computer use = operating within existing digital interfaces, and world models = generating new interactive environments, then convergence creates AI agents that both operate existing software AND simulate new scenarios. Insurance claims adjusters, for example, could have an AI agent that processes the FNOL form (computer use) and generates a 3D reconstruction of the accident scene (world model) from photos.
The Contrarian View: Reliability Gap
94% accuracy sounds high but means 6% error rate. In insurance, a 6% error rate on claims processing could mean thousands of incorrectly processed claims, regulatory violations, and litigation exposure. RPA's advantage is deterministic reliability — auditable, traceable, repeatable. AI computer use is probabilistic, and the gap from 94% to 99.9% (production-grade reliability for regulated industries) may be harder to close than the gap from 14.9% to 94%. Additionally, UiPath and other RPA vendors are integrating AI into their platforms, potentially capturing the upside without ceding the market.
But at the current improvement trajectory, the question is not whether AI computer use replaces RPA, but when. The Vercept acquisition suggests Anthropic believes "when" is measured in quarters, not years.
What This Means for Practitioners
Enterprise teams currently using RPA for desktop automation should begin evaluating AI computer use as a supplement or replacement. For insurance, legal, and financial services workflows — high-volume, form-heavy, legacy UI — Claude Sonnet 4.6 at cached pricing is already more cost-effective than RPA licensing with comparable or superior accuracy on complex tasks.
Start with hybrid approaches: AI for complex unstructured tasks, RPA for simple deterministic ones. Production-readiness for insurance vertical is NOW (94% accuracy). General enterprise computer use: 3-6 months for early adopters in regulated industries. Broad RPA displacement: 12-18 months as accuracy approaches 99%+ and enterprise-grade reliability tooling matures.
Competitive implications are stark: Anthropic is building the strongest position in enterprise computer use via acqui-hire strategy (Bun + Vercept in 3 months). UiPath and Automation Anywhere face existential pressure to integrate AI models deeply or lose market share. Google's world model capability (Project Genie) extends the threat from digital to physical world interaction. Companies building on UiPath for long-term automation should evaluate migration timelines now, not after market share has shifted.