Key Takeaways
- Saining Xie created DiT (Diffusion Transformer), now powering OpenAI's Sora and LTX-2.3 production video — then co-founded AMI Labs to build JEPA architecture that explicitly replaces Transformers. His own innovation commercialized inside the system he is trying to replace.
- NVIDIA's Nemotron 3 Super absorbed Mamba-2 (proposed as O(n) alternative to O(n²) attention) as a component layer — demonstrating that alternative architectures contribute their best ideas to hybrid Transformer systems rather than building competing ecosystems.
- The 55:1 capital ratio ($110B Transformer vs $2B world models) means the Transformer ecosystem has 5+ years of compounding ecosystem advantage (tooling, fine-tunes, developer knowledge, training infrastructure) before JEPA reaches commercial viability.
- JEPA may resist absorption if its training objective (representation-space prediction) is architecturally incompatible with Transformer integration — you cannot add a 'JEPA layer' the way you add a Mamba layer. This is the key differentiator to watch.
- Default recommendation for production deployments: Transformer-based now, hybrid Mamba+Transformer for long-context agentic workloads, JEPA as 2027-2028 research signal.
The DiT Irony: Innovation Inside the System You're Trying to Replace
Saining Xie, now Chief Scientist at AMI Labs ($1.03B JEPA startup), created the Diffusion Transformer (DiT) architecture during his time at NYU/Meta. DiT became the architectural backbone for OpenAI's Sora and, as of March 2026, Lightricks' LTX-2.3 — the open-source model generating 4K/50fps video on consumer hardware at 18x the speed of prior open-source SOTA.
LTX-2.3 is a 22B parameter DiT that is commercially production-grade, Apache 2.0 licensed, and runs on an RTX 3080. It represents successful commercialization of Xie's architectural innovation. But it commercialized within the Transformer ecosystem, not as an alternative to it — DiT is a Diffusion Transformer, using attention mechanisms adapted for diffusion-based generation rather than autoregressive prediction.
Now Xie is at AMI Labs building JEPA — an architecture explicitly designed to move beyond Transformers. The question is whether JEPA will follow the same trajectory: producing important architectural innovations that the Transformer ecosystem absorbs and deploys at scale before JEPA can establish its own commercial ecosystem. History suggests yes. Xie's own track record suggests the absorption pattern is structural, not incidental.
The Mamba Absorption: From Transformer Alternative to Transformer Component
NVIDIA's Nemotron 3 Super provides the clearest evidence of the absorption pattern. Mamba-2 (state space models with linear sequence complexity) was proposed as an alternative to attention-based Transformers for long-sequence processing — the argument being that O(n) complexity would eventually replace O(n²) attention as context windows grew.
Instead of replacement, absorption occurred: Nemotron 3 Super uses Mamba-2 layers for the majority of sequence processing (cheap, linear-time context handling) while retaining selective Attention layers for high-precision reasoning steps. The result — 91.75% RULER at 1M tokens vs GPT-OSS's 22.30% — demonstrates that Mamba's core innovation (linear-time sequence processing) is more valuable as a component of a hybrid Transformer system than as a standalone alternative. The LatentMoE innovation adds Mixture-of-Experts routing (another non-Transformer concept) on top — creating a Frankenstein of Mamba + Attention + MoE where the Transformer framework provides the integration platform for all three innovations.
The Ecosystem Gravity Engine
Four structural forces drive the absorption pattern:
Tooling lock-in: Every ML framework (PyTorch, JAX), inference engine (vLLM, TensorRT), and deployment platform (HuggingFace, NVIDIA NIM) is optimized for Transformer-based architectures. Building equivalent tooling for JEPA requires years of engineering investment — independent of JEPA's technical merit.
Training infrastructure: Frontier training runs ($10M-$100M) are optimized for attention-based architectures. Nemotron's 25 trillion token pretraining corpus was trained on infrastructure specifically designed for Transformer workloads. JEPA needs equivalent investment from scratch.
Developer knowledge: Millions of ML engineers understand Transformers, attention mechanisms, and fine-tuning patterns. Alternative architectures must be dramatically better — not just marginally better — to justify retraining the entire developer base.
Fine-tuning ecosystem compounding: Transformer-based models have thousands of task-specific fine-tunes on HuggingFace. The value of a foundation model compounds with its ecosystem. Alternative architectures start from zero and cannot leverage any existing fine-tune.
The Capital Asymmetry and JEPA's Only Path
The Q1 2026 capital deployment is stark: over $112B within the Transformer ecosystem (OpenAI $110B, Apple-Google $1B, LTX-2 DiT deployment) versus $2B for paradigm replacement (AMI Labs + World Labs). The 56:1 ratio means the Transformer ecosystem has overwhelming resources to absorb any innovation that alternative architectures produce, and to extend competitive leads through scaling.
AMI Labs targets commercial applications in 1-2 years and universal systems in 3-5 years. In those 3-5 years, OpenAI will deploy $110B of capital extending Transformer dominance. NVIDIA will release two or three more hybrid model generations, absorbing any useful innovations from JEPA research. The architectural insurgent must not only be better but dramatically better.
JEPA's only structural path to independent success is if its training objective — predicting in abstract representation space — is architecturally incompatible with Transformer integration. You cannot add a 'JEPA layer' to a Transformer the way you add a Mamba layer. If JEPA requires a fundamentally different training paradigm, the absorption pattern may not apply. The historical analogy would then be CNNs eventually giving way to Transformers — a paradigm that required wholesale replacement, not incremental absorption. VL-JEPA's 1.6B parameter efficiency advantage at small scale is the first empirical signal to watch for whether JEPA's scaling laws validate this hypothesis.
Architecture Innovation-to-Absorption Cycle: DiT and Mamba Follow the Same Pattern
Alternative architecture innovations reach commercial deployment inside the Transformer ecosystem faster than they can build independent alternatives
LeCun proposes Joint Embedding Predictive Architecture as post-Transformer paradigm
Diffusion Transformer architecture designed for visual generation
OpenAI deploys Xie's DiT architecture in proprietary Transformer ecosystem
State space model proposed as O(n) alternative to O(n^2) attention
DiT reaches production-grade open-source deployment in Transformer ecosystem
NVIDIA hybrid model uses Mamba-2 as component layer within Transformer framework
Commercial JEPA effort begins — 4 years after paper, still pre-product
Source: arXiv publication dates, product launch announcements
What This Means for Practitioners
For production architecture selection: Default to Transformer-based systems (GPT-5.4, Gemini, Claude) for the next 2-3 years. Adopt hybrid Mamba+Transformer architectures (Nemotron 3 Super) for agentic workloads requiring long context — the 91.75% RULER at 1M tokens advantage is real and accessible now.
For research-stage evaluation: Monitor VL-JEPA scaling results from Meta FAIR as the key empirical test of JEPA's architecture-level differentiation. If VL-JEPA performance holds at 7B parameters with proportional efficiency advantages, it becomes a legitimate 2027-2028 production consideration. If performance plateaus or scaling laws prove weaker than Transformer equivalents, the absorption trajectory is confirmed.
For investment and roadmap planning: JEPA and world model architectures carry genuine 3-5 year uncertainty. Do not build irreversible infrastructure dependencies on AMI Labs' timeline. The Transformer ecosystem's absorption engine is fast — DiT went from paper (2023) to production-grade open-source deployment (LTX-2.3, 2026) in 3 years. JEPA has been a paper since 2022 and is still pre-product. The clock is running on both sides simultaneously.