AI Inference Fragmentation: Cerebras, Linear Attention, and Agent Swarms End the GPU Era

OpenAI's Codex-Spark (1,000+ tokens/sec on Cerebras), Qwen3.5's GDN linear attention (19x speedup), and Kimi K2.5's PARL agent swarm (4.5x acceleration) represent simultaneous disruptions ending GPU monoculture.

TL;DRNeutral ⚪

•OpenAI's GPT-5.3-Codex-Spark runs on Cerebras WSE-3 at 1,000+ tokens/sec — 15x faster than H100 for interactive inference — in a $10B, 750-megawatt deal through 2028
•Qwen3.5-397B uses GatedDeltaNet linear attention (O(n) complexity) for 75% of transformer sublayers, delivering 19x inference speedup at 256K context vs. standard quadratic attention architectures
•Kimi K2.5's PARL (Parallel-Agent Reinforcement Learning) enables autonomous orchestration of 100 parallel sub-agents across 1,500 tool calls — 4.5x faster than single-agent execution, at 9x lower cost than Claude Opus 4.5
•AI infrastructure is fragmenting across three layers simultaneously: hardware (GPU → specialized silicon), architecture (quadratic → linear attention), and orchestration (single-agent → autonomous swarms)
•Production teams should evaluate all three dimensions — wrong hardware, architecture, or orchestration pattern each carries a 5-20x efficiency penalty

cerebras wafer scale engineai inference hardwarelinear attention gdnqwen3.5 benchmarkkimi k2.5 agent swarm6 min readFeb 17, 2026

Key Takeaways

OpenAI's GPT-5.3-Codex-Spark runs on Cerebras WSE-3 at 1,000+ tokens/sec — 15x faster than H100 for interactive inference — in a $10B, 750-megawatt deal through 2028
Qwen3.5-397B uses GatedDeltaNet linear attention (O(n) complexity) for 75% of transformer sublayers, delivering 19x inference speedup at 256K context vs. standard quadratic attention architectures
Kimi K2.5's PARL (Parallel-Agent Reinforcement Learning) enables autonomous orchestration of 100 parallel sub-agents across 1,500 tool calls — 4.5x faster than single-agent execution, at 9x lower cost than Claude Opus 4.5
AI infrastructure is fragmenting across three layers simultaneously: hardware (GPU → specialized silicon), architecture (quadratic → linear attention), and orchestration (single-agent → autonomous swarms)
Production teams should evaluate all three dimensions — wrong hardware, architecture, or orchestration pattern each carries a 5-20x efficiency penalty

Three Simultaneous Disruptions

For five years, NVIDIA's GPU monopoly on AI inference was structurally stable. Training required H100s; inference ran on H100s; the developer ecosystem was built around CUDA. That assumption shattered in February 2026, not from a single breakthrough, but from three simultaneous disruptions hitting different layers of the AI infrastructure stack.

On February 13, 2026, OpenAI deployed GPT-5.3-Codex-Spark on Cerebras Systems' Wafer Scale Engine 3 — the first production AI model from a major lab running on non-NVIDIA silicon. Four days earlier, Alibaba had released Qwen3.5-397B, a frontier-class model using linear attention for most of its transformer layers, delivering 8-19x faster inference than standard attention architectures. And on January 27, Moonshot AI released Kimi K2.5, the first open-source model trained end-to-end to autonomously orchestrate 100 parallel sub-agents.

Each disruption targets a different layer of the AI inference stack. Together, they signal that the monolithic GPU monoculture assumption — all AI compute is fungible — is ending.

Disruption 1: Hardware — Cerebras WSE-3 at 1,000+ Tokens/Second

OpenAI's partnership with Cerebras is not a pilot: it's a $10B commitment for 750 megawatts of compute through 2028. The hardware difference is architectural. NVIDIA GPU clusters connect thousands of smaller chips via high-speed interconnects (NVLink, InfiniBand). Every token generation requires data movement between chips — and that movement latency compounds over hundreds of sequential forward passes in an interactive inference session.

Cerebras' WSE-3 is a single wafer-scale processor with all cores and memory on one massive die. Data movement is eliminated. The result: 1,000+ tokens per second for GPT-5.3-Codex-Spark, compared to approximately 70 tokens/second for the same model class on an H100. For interactive coding (where each user turn requires the model to generate a complete function or explanation), this is the difference between a 5-second wait and a 500-millisecond response — the psychological threshold between tool and collaborator.

OpenAI is explicit that Cerebras complements rather than replaces GPUs. Training and batch inference remain GPU-optimal. But the right hardware for interactive latency-sensitive inference is not the right hardware for training, and companies treating all AI compute as fungible will pay a structural latency tax on interactive workloads.

# API optimization that shipped with Codex-Spark
# WebSocket path reduces per-roundtrip overhead by 80%
import openai

client = openai.OpenAI()

# WebSocket path is enabled by default for Codex-Spark
# and will become default for all models soon
stream = client.responses.create(
    model="gpt-5.3-codex-spark",
    input="Implement a binary search tree with in-order traversal",
    stream=True  # WebSocket path, TTFT reduced 50%
)

for event in stream:
    print(event, end="", flush=True)

The API improvements that came with Codex-Spark — WebSocket persistent connections reducing per-roundtrip overhead 80%, time-to-first-token reduced 50% — will roll out to all models, making the infrastructure benefits broadly available regardless of hardware.

AI Inference Throughput: Specialized Silicon vs GPU

Tokens per second comparison across inference hardware for interactive workloads

Source: OpenAI/Cerebras Feb 2026, Groq benchmarks

Disruption 2: Architecture — GatedDeltaNet Beats Quadratic Attention at Scale

Qwen3.5-397B, released February 16, 2026, is the first frontier-class model to use linear attention for the majority of its transformer sublayers at production scale. The architecture: a 60-layer stack structured as repeating blocks of 3× (GatedDeltaNet → MoE) → 1× (GatedAttention → MoE). Three of every four sublayers use Gated Delta Networks (GDN) — a state-based recurrence architecture delivering O(n) complexity — with full quadratic attention only in the fourth.

The practical consequence: at 256K context length, Qwen3.5 is 19x faster than Qwen3-Max (same capability tier, standard quadratic attention), with constant memory complexity that scales to 1M tokens. Standard attention requires memory proportional to n² — at 1M tokens, that means 1 trillion attention weights per layer. Linear attention requires memory proportional to n — the same at 1M tokens as at 128K.

Model	Max Context	Attention Type	GPQA Diamond	LiveCodeBench v6
Qwen3.5-397B-A17B	1M tokens	GDN hybrid (O(n))	88.4	83.6
GPT-5.2	128K tokens	Standard (O(n²))	~85.0	87.7
Claude Opus 4.5	200K tokens	Standard (O(n²))	~84.1	80.9
Qwen3-Max (prior)	128K tokens	Standard (O(n²))	~84.0	~80.0

Benchmark nuance: Qwen3.5 leads on GPQA Diamond (88.4 vs GPT-5.2's ~85) and instruction following (IFBench: 76.5, best in class), but trails GPT-5.2 on pure reasoning ceiling (LiveCodeBench: 83.6 vs 87.7; AIME26: 91.3 vs 96.7). The trade-off is deliberate: linear attention optimizes for inference economics and long-context reliability, not peak reasoning on short problems.

Self-hosting: 8×H100, 45 tokens/second, ~$0.18 per 1M-token context query. Under Apache 2.0 — commercially deployable with no restrictions.

# Qwen3.5 via Alibaba Cloud API (OpenAI-compatible)
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_KEY",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.5-397b-instruct",
    messages=[{"role": "user", "content": "Analyze this 500-page contract..."}],
    max_tokens=8192
)

Disruption 3: Orchestration — Kimi K2.5 PARL Trains Agent Swarms End-to-End

Kimi K2.5 (January 27, 2026) is the first model to learn autonomous agent orchestration through reinforcement learning, rather than requiring developers to define agent roles and handoffs explicitly. The methodology: PARL (Parallel-Agent Reinforcement Learning) trains an orchestrator agent to decompose tasks and spawn specialized sub-agents without predefined workflows. Sub-agents execute parallel workstreams across up to 1,500 coordinated tool calls.

The benchmark evidence is concrete: enabling Agent Swarm mode produces +18.4pp improvement on BrowseComp (complex web research requiring multi-source synthesis) and +6.3pp on WideSearch vs. single-agent mode on the same model. The 4.5x execution time reduction with 100 parallel agents demonstrates viable production throughput, not just benchmark cherry-picking.

Model	BrowseComp	SWE-Bench Verified	HLE (with tools)	Cost ($/M input)
Kimi K2.5 (swarm mode)	74.9%	76.8%	50.2%	$0.60
GPT-5.2 (xhigh)	~59.2%	~80.0%	45.5%	$5.00
Claude Opus 4.5	~59.2%	80.9%	~45.0%	$15.00

# Kimi K2.5 via OpenAI-compatible API
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MOONSHOT_KEY",
    base_url="https://api.moonshot.ai/v1"
)

# Agent Swarm mode for complex multi-step tasks
response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[
        {
            "role": "user",
            "content": "Research the top 3 investment opportunities "
                       "in each of 20 emerging AI hardware companies, "
                       "cross-referencing funding, benchmarks, and partnerships."
        }
    ],
    extra_body={"mode": "agent_swarm"}  # Spawns parallel sub-agents
)

At $0.60/M input tokens, a complex 100-agent swarm task consuming 5M tokens (100 agents × 50K each) costs $3. The same task on Claude Opus 4.5: $75. This cost structure fundamentally changes the economics of multi-agent application development.

The Three-Layer Infrastructure Fragmentation

These disruptions operate at different layers but converge on the same engineering challenge: the right tool for each layer depends on workload characteristics, not default assumptions.

Hardware layer: Interactive inference (coding assistants, chat, real-time agents) → specialized silicon (Cerebras, Groq). Batch inference and training → GPUs. Treating them as fungible costs 5-15x in throughput.
Architecture layer: Long-context workloads (>64K tokens) → linear attention models (Qwen3.5, future architectures). Short-context high-reasoning → quadratic attention models (GPT-5.2, Claude Opus). Wrong choice costs 10-20x in inference economics.
Orchestration layer: Complex multi-step research, parallel information gathering → agent swarm (Kimi K2.5 PARL). Single-step reasoning, code generation → single-agent. Wrong choice costs 4.5x in execution time and missing +18pp capability.

Three-Layer AI Infrastructure Fragmentation (Feb 2026)

Key metrics showing simultaneous disruption across hardware, architecture, and orchestration layers

15x faster

Cerebras vs H100 throughput

▲ for interactive inference

19x

Linear attention speedup at 256K

▲ vs quadratic attention

4.5x faster

Agent swarm task acceleration

▲ 100 parallel subagents

9x cheaper

Cost: open vs proprietary agents

▼ Kimi K2.5 vs Claude Opus 4.5

Source: OpenAI Feb 2026, Alibaba Feb 2026, Moonshot AI Jan 2026

What This Means for Practitioners

If you're building interactive developer tools: Evaluate Cerebras-backed inference endpoints for latency-critical paths. The 80% API overhead reduction that shipped with Codex-Spark is available for all models via WebSocket paths — enable this first.

If you're building long-context applications (document analysis, multi-session agents, whole-codebase reasoning): Qwen3.5-397B's linear attention delivers 19x speedup vs. quadratic alternatives at 256K context, for $0.18 per 1M-token query. This is the architecture to evaluate first for contexts above 64K tokens.

If you're building complex multi-step agent workflows: Benchmark Kimi K2.5 in Agent Swarm mode for tasks requiring parallel research or multi-source synthesis. The PARL-trained orchestrator outperforms manually defined agent frameworks on complex research tasks. At $0.60/M input, the cost barrier to multi-agent production deployment has dropped by an order of magnitude.

The fragmentation is happening now. The penalty for infrastructure inertia — continuing to default to GPU/quadratic attention/single-agent without re-evaluating — compounds as workloads scale.