FlashAttention-4 Breaks the Petaflop Barrier: 1M Context Windows Move from Theory to Economics

FlashAttention-4 achieves 1,605 TFLOPS on NVIDIA's Blackwell B200 — 3.6x faster than FA2 at 32K tokens — directly changing the economics of long-context inference. Combined with Claude Opus 4.6's validated 76% accuracy at 1M tokens, this breaks the cost barrier for replacing RAG chunking with direct full-document inference in enterprise legal, financial, and compliance workflows.

TL;DRBreakthrough 🟢

•FlashAttention-4 achieves 1,605 TFLOPS on Blackwell B200 — 71% of theoretical maximum and 3.6x faster than FA2 at 32K-token passes
•The speedup compounds at longer context lengths: attention scales O(n²), so 1M tokens require ~930x more computation than 32K tokens; FA4's optimizations amplify at scale
•Claude Opus 4.6 achieves 76% MRCR v2 accuracy at 1M tokens (vs Sonnet 4.5: 18.5%) — the first production-validated long-context quality baseline
•Direct full-document inference now economically replaces RAG for corpora up to ~750 pages, eliminating chunk engineering, embedding pipelines, and retrieval uncertainty
•The efficiency gains stack orthogonally: FA4 (hardware kernel) + agent parallelism (Kimi K2.5's 4.5x speedup) + model sparsity (MoE) operate independently and compose

flashattentioninference-optimizationlong-contextblackwellgpu6 min readFeb 24, 2026

Key Takeaways

FlashAttention-4 achieves 1,605 TFLOPS on Blackwell B200 — 71% of theoretical maximum and 3.6x faster than FA2 at 32K-token passes
The speedup compounds at longer context lengths: attention scales O(n²), so 1M tokens require ~930x more computation than 32K tokens; FA4's optimizations amplify at scale
Claude Opus 4.6 achieves 76% MRCR v2 accuracy at 1M tokens (vs Sonnet 4.5: 18.5%) — the first production-validated long-context quality baseline
Direct full-document inference now economically replaces RAG for corpora up to ~750 pages, eliminating chunk engineering, embedding pipelines, and retrieval uncertainty
The efficiency gains stack orthogonally: FA4 (hardware kernel) + agent parallelism (Kimi K2.5's 4.5x speedup) + model sparsity (MoE) operate independently and compose

From Theoretical to Economical: The Long Context Inflection Point

The economics of LLM inference follow a simple rule: longer context windows are exponentially more expensive because attention computation scales O(n²) with sequence length. A 1M-token sequence versus 32K tokens requires approximately (1,000,000 / 32,768)² ≈ 930x more attention computation per inference step.

This quadratic bottleneck made 1M context windows technically achievable but economically impractical for production workloads. FlashAttention-4 directly attacks this bottleneck.

The FA4 Architecture Breakthrough: 1,605 TFLOPS

Presented at Hot Chips 2025 by Tri Dao (Chief Scientist, TogetherAI), FlashAttention-4 is the first attention kernel to breach the petaflop barrier on consumer GPUs — achieving 1,605 TFLOPS on NVIDIA's Blackwell B200, which has a theoretical maximum of 2,250 TFLOPS.

This represents:

71% hardware utilization: Approaching practical maximum efficiency for memory-bound workloads
3.6x speedup over FA2: At 32K-token forward passes (the standard benchmark range)
22% faster than NVIDIA cuDNN: Even on NVIDIA's own reference implementation
2x speedup over FA3: On Hopper H100 architecture

Four architectural innovations enable this leap:

5-stage warp-specialized pipeline (vs FA3's 2-stage ping-pong): Each warp specializes in a specific computational role (tile input loading, GEMM computation, softmax, tile output writing), enabling deeper pipelining and higher hardware utilization
TMEM (Tensor Memory) utilization: Blackwell's 256KB per-SM dedicated tensor memory eliminates shared memory bandwidth bottlenecks that plagued prior architectures
Online softmax with 90% rescaling reduction: Exponential computation approximated via cubic polynomial on CUDA cores (MUFU.EX2), enabling overlap of softmax with tensor core matrix multiplications
5th-generation tensor core instructions (SM10.0 PTX, tcgen05.mma): Blackwell-exclusive instructions unavailable on prior architectures

FlashAttention-4 is already production-deployed: SGLang and vLLM integrate it for prefill operations; NVIDIA incorporated FA4 techniques into cuDNN 9.14, extending speedups to all cuDNN-using frameworks without custom kernel work.

FlashAttention Generation-over-Generation Peak Throughput

Peak TFLOPS for each FlashAttention version on its target NVIDIA GPU architecture.

Source: Tri Dao / TogetherAI Hot Chips 2025 + historical benchmarks

The Compounding Effect at 1M Tokens: Attention Dominance

The 3.6x benchmark is measured at 32K tokens. At 1M tokens, attention computation dominates inference cost, and the relative savings amplify. FA4's memory bandwidth optimizations (TMEM usage, warp-specialized tiling efficiency, reduced rescaling work) reduce the memory-bound bottleneck that compounds at extreme sequence lengths. Real-world speedups at 1M token ranges are expected to exceed the 32K headline figure.

This directly reshapes the economics of Claude Opus 4.6's extended context offering. Standard pricing: $5 input / $25 output per million tokens. Extended context (>200K tokens): $10 input / $37.50 output — a 2x input premium. Claude Opus 4.6 achieves 76% MRCR v2 accuracy at 1M tokens versus Sonnet 4.5's 18.5%, providing the first independently validated quality baseline for production long-context use.

As FA4-enabled Blackwell clusters reduce compute costs for long-context inference, the extended context premium becomes commercially sustainable at increasing volumes. The adoption threshold shifts from "Can we afford 1M context?" to "When should we replace RAG with direct document inference?"

RAG Replacement Economics: Direct Full-Document Inference

Previously, 1M-token workloads required Retrieval-Augmented Generation (RAG): split documents into chunks, embed them separately, retrieve top-K chunks via similarity search, then assemble a context window from fragments. This architecture:

Loses cross-document signals: Chunks are scored independently; relationships between documents are lost
Requires embedding pipeline maintenance: Update cycles, version management, evaluation of embedding quality
Introduces retrieval quality uncertainty: Top-K retrieval often misses relevant context; no end-to-end quality signal

With FA4 reducing 1M-context inference costs, direct full-document inference becomes economically viable for corpora up to approximately 750 pages. For enterprise legal, financial, and compliance workflows — where document relationships and full context matter — this eliminates an entire engineering layer while capturing richer inter-document connections.

The practical implication: ML engineers building document-heavy enterprise applications should reevaluate RAG architectures. Direct inference on Blackwell with FA4 + Claude Opus 4.6 may be cheaper and higher-quality than maintaining a RAG pipeline.

The 2026 Inference Stack: Cloud and Edge

FA4 defines one pole; ExecuTorch 1.0 GA (Meta, October 2025) defines the other. Together, they establish the upper and lower bounds of the 2026 AI inference efficiency stack:

Edge (ExecuTorch): Sub-20ms latency, 50KB runtime, quantized 4-8B models, fully on-device, privacy-first
Cloud (FA4 + Blackwell): 1,605 TFLOPS, 32K-1M token context, frontier model quality

This is not competition — it reflects that AI deployment spans a spectrum from smartphone assistants (ExecuTorch + Phi-3) to enterprise legal document analysis (FA4 + Claude 1M context). Each optimization layer serves a distinct use case at a distinct cost point.

Orthogonal Efficiency Gains: Hardware, Software, and Architecture

Kimi K2.5's 4.5x task speedup on complex research tasks comes from parallelizing across 100 specialized sub-agents (PARL architecture). This is a software-level approach to throughput optimization. FlashAttention-4 addresses the same challenge at the hardware kernel level. The optimizations are orthogonal and composable: an agent swarm executing on FA4-enabled Blackwell infrastructure can achieve compound throughput gains exceeding either approach independently.

The 2026 efficiency gains operate simultaneously at three orthogonal layers:

Hardware kernel optimization (FA4): 3.6x+ speedup via IO-aware attention compute
Model architecture (MoE sparse activation): Conditional computation reducing FLOP count by 2-4x
Orchestration (agent parallelism): 4.5x speedup via sub-agent swarms

These stack multiplicatively. A 100-agent PARL swarm on FA4-enabled Blackwell infrastructure executing sparse MoE models can achieve gains exceeding any single layer by orders of magnitude.

Current Limitations: Forward Pass Only, Blackwell Hardware Requirement

FA4's primary constraints:

Inference-only support: Forward propagation is optimized; backward pass still requires FA3 on Hopper. Training and fine-tuning workflows cannot yet leverage FA4's speedups
Unrealized optimizations: FA4 does not yet utilize Blackwell's FP4 operations or 2-CTA matrix multiplications, leaving performance gains in reserve for future versions
Hardware coupling: Blackwell-only compatibility (SM10.0 PTX) means A100/H100 infrastructure sees no benefit without hardware upgrades

These are not showstoppers for inference-heavy enterprise workloads, where forward-pass optimization is the primary value driver.

Adoption Headwinds: The Long-Context Adoption Problem

The 1M context adoption thesis has historical headwinds: Google Gemini 1.5 introduced 1M context in early 2024 with limited enterprise uptake. Market resistance is not purely economic — it's also workflow inertia and the difficulty of evaluating retrieval quality for long-context inference.

FA4 reduces the cost barrier; it does not resolve integration and evaluation challenges. Long-context adoption requires:

Retraining evaluation pipelines for full-document comprehension
Changing procurement workflows from "Does this RAG vendor have good retrievers?" to "Can this model ingest full documents accurately?"
Addressing regulatory questions: if a model reads the entire contract, are there explainability requirements on which passages influenced the final answer?

The counterargument: Claude Opus 4.6's independently validated 76% MRCR v2 at 1M tokens is qualitatively different from prior long-context claims that degraded at scale. Cost and quality improvements together may reach the adoption threshold that cost alone could not.

What This Means for Practitioners

For ML engineers building document-heavy enterprise applications (legal, financial, compliance), the practical recommendation is immediate: reevaluate your RAG architecture.

Timeline and Action Items:

0-3 months (now): FA4 benefits are available now via SGLang, vLLM, or cuDNN 9.14 on Blackwell B200. If your cloud provider (AWS, Azure, GCP) has Blackwell allocations, direct inference experiments should begin immediately
6-12 months: Blackwell availability expands; FA4 becomes the default inference backend on new deployments. Expect cost parity or better for 1M-context workloads versus current 128K-context RAG pipelines
12-18 months: Long-context RAG replacement architectures mature for regulated industries requiring full-document traceability. Procurement standards shift toward native long-context support

For competitive positioning: First-mover advantage exists for vendors shipping FA4-optimized long-context offerings. Anthropic (1M Opus 4.6 quality + FA4 economics = first credible enterprise 1M offering), TogetherAI (FA4 originator with deployment advantage), and cloud providers with early Blackwell allocation (AWS, Azure, GCP) can establish long-context market position before RAG-specialized vendors adapt.

Disadvantaged: RAG-specialized vendors whose differentiation erodes as direct long-context inference costs fall, and H100-heavy infrastructure providers unable to access FA4 gains without hardware upgrades.