The Transformer Black Hole: Alternative AI Architecture Innovations Are Absorbed Faster Than They Can Build Competing Ecosystems

DiT creator Saining Xie now builds JEPA to replace Transformers, but DiT was absorbed into Sora and LTX-2.3. Mamba-2 became a Nemotron component. The pattern is structural.

TL;DRNeutral ⚪

•Saining Xie created DiT (Diffusion Transformer), now powering OpenAI's Sora and LTX-2.3 production video — then co-founded AMI Labs to build JEPA architecture that explicitly replaces Transformers. His own innovation commercialized inside the system he is trying to replace.
•NVIDIA's Nemotron 3 Super absorbed Mamba-2 (proposed as O(n) alternative to O(n²) attention) as a component layer — demonstrating that alternative architectures contribute their best ideas to hybrid Transformer systems rather than building competing ecosystems.
•The 55:1 capital ratio ($110B Transformer vs $2B world models) means the Transformer ecosystem has 5+ years of compounding ecosystem advantage (tooling, fine-tunes, developer knowledge, training infrastructure) before JEPA reaches commercial viability.
•JEPA may resist absorption if its training objective (representation-space prediction) is architecturally incompatible with Transformer integration — you cannot add a 'JEPA layer' the way you add a Mamba layer. This is the key differentiator to watch.
•Default recommendation for production deployments: Transformer-based now, hybrid Mamba+Transformer for long-context agentic workloads, JEPA as 2027-2028 research signal.

TransformerJEPADiTMambaAMI Labs5 min readMar 23, 2026

Medium📅Long-termML engineers should default to Transformer-based architectures for production deployments and adopt hybrid approaches (Mamba layers for long context, MoE for efficiency) as they become available in standard frameworks. JEPA and world model architectures are research-stage investments with 3-5 year horizons — monitor VL-JEPA scaling results but do not plan production architectures around them yet.Adoption: Hybrid Transformer architectures (Mamba + Attention + MoE) available now via Nemotron 3 Super. DiT-based video generation available now via LTX-2.3. JEPA-based production systems: earliest 2027, likely 2028-2029 for general availability.

Cross-Domain Connections

Saining Xie created DiT (Diffusion Transformer) architecture, now deployed in LTX-2.3 (production-grade 4K video, Apache 2.0)→Xie co-founded AMI Labs ($1.03B) to build JEPA architecture that explicitly replaces Transformers

The same researcher's previous innovation (DiT) was absorbed into the Transformer ecosystem and commercialized by others. His new innovation (JEPA) faces the same absorption risk — but JEPA's training objective (representation-space prediction) may be architecturally incompatible with incremental integration, which is the key differentiator.

Nemotron 3 Super absorbs Mamba-2 (SSM), MoE routing, and selective Attention into a single 120B/12B hybrid model→OpenAI $110B raise doubles down on pure Transformer scaling with GPT-5.4

The Transformer ecosystem has two absorption strategies: NVIDIA integrates alternative innovations (Mamba) as components; OpenAI outscales them with pure Transformer brute force ($110B). Alternative architectures must survive both simultaneous attacks — co-option of their innovations AND competitive scaling from the incumbent.

LTX-2.3 DiT model achieves production-grade 4K video generation on consumer hardware→AMI Labs targets 1-2 years for initial JEPA commercial applications, 3-5 years for universal systems

DiT (a Transformer variant) reached full commercial deployment in ~3 years (2023 paper to 2026 LTX-2.3). JEPA targets 3-5 years from 2022 paper to universal systems (2027-2029). The Transformer ecosystem's absorption cycle is faster than the alternative paradigm's commercialization cycle — unless JEPA's advantage is so large that it justifies a clean break.

Key Takeaways

Saining Xie created DiT (Diffusion Transformer), now powering OpenAI's Sora and LTX-2.3 production video — then co-founded AMI Labs to build JEPA architecture that explicitly replaces Transformers. His own innovation commercialized inside the system he is trying to replace.
NVIDIA's Nemotron 3 Super absorbed Mamba-2 (proposed as O(n) alternative to O(n²) attention) as a component layer — demonstrating that alternative architectures contribute their best ideas to hybrid Transformer systems rather than building competing ecosystems.
The 55:1 capital ratio ($110B Transformer vs $2B world models) means the Transformer ecosystem has 5+ years of compounding ecosystem advantage (tooling, fine-tunes, developer knowledge, training infrastructure) before JEPA reaches commercial viability.
JEPA may resist absorption if its training objective (representation-space prediction) is architecturally incompatible with Transformer integration — you cannot add a 'JEPA layer' the way you add a Mamba layer. This is the key differentiator to watch.
Default recommendation for production deployments: Transformer-based now, hybrid Mamba+Transformer for long-context agentic workloads, JEPA as 2027-2028 research signal.

The DiT Irony: Innovation Inside the System You're Trying to Replace

Saining Xie, now Chief Scientist at AMI Labs ($1.03B JEPA startup), created the Diffusion Transformer (DiT) architecture during his time at NYU/Meta. DiT became the architectural backbone for OpenAI's Sora and, as of March 2026, Lightricks' LTX-2.3 — the open-source model generating 4K/50fps video on consumer hardware at 18x the speed of prior open-source SOTA.

LTX-2.3 is a 22B parameter DiT that is commercially production-grade, Apache 2.0 licensed, and runs on an RTX 3080. It represents successful commercialization of Xie's architectural innovation. But it commercialized within the Transformer ecosystem, not as an alternative to it — DiT is a Diffusion Transformer, using attention mechanisms adapted for diffusion-based generation rather than autoregressive prediction.

Now Xie is at AMI Labs building JEPA — an architecture explicitly designed to move beyond Transformers. The question is whether JEPA will follow the same trajectory: producing important architectural innovations that the Transformer ecosystem absorbs and deploys at scale before JEPA can establish its own commercial ecosystem. History suggests yes. Xie's own track record suggests the absorption pattern is structural, not incidental.

The Mamba Absorption: From Transformer Alternative to Transformer Component

NVIDIA's Nemotron 3 Super provides the clearest evidence of the absorption pattern. Mamba-2 (state space models with linear sequence complexity) was proposed as an alternative to attention-based Transformers for long-sequence processing — the argument being that O(n) complexity would eventually replace O(n²) attention as context windows grew.

Instead of replacement, absorption occurred: Nemotron 3 Super uses Mamba-2 layers for the majority of sequence processing (cheap, linear-time context handling) while retaining selective Attention layers for high-precision reasoning steps. The result — 91.75% RULER at 1M tokens vs GPT-OSS's 22.30% — demonstrates that Mamba's core innovation (linear-time sequence processing) is more valuable as a component of a hybrid Transformer system than as a standalone alternative. The LatentMoE innovation adds Mixture-of-Experts routing (another non-Transformer concept) on top — creating a Frankenstein of Mamba + Attention + MoE where the Transformer framework provides the integration platform for all three innovations.

The Ecosystem Gravity Engine

Four structural forces drive the absorption pattern:

Tooling lock-in: Every ML framework (PyTorch, JAX), inference engine (vLLM, TensorRT), and deployment platform (HuggingFace, NVIDIA NIM) is optimized for Transformer-based architectures. Building equivalent tooling for JEPA requires years of engineering investment — independent of JEPA's technical merit.

Training infrastructure: Frontier training runs ($10M-$100M) are optimized for attention-based architectures. Nemotron's 25 trillion token pretraining corpus was trained on infrastructure specifically designed for Transformer workloads. JEPA needs equivalent investment from scratch.

Developer knowledge: Millions of ML engineers understand Transformers, attention mechanisms, and fine-tuning patterns. Alternative architectures must be dramatically better — not just marginally better — to justify retraining the entire developer base.

Fine-tuning ecosystem compounding: Transformer-based models have thousands of task-specific fine-tunes on HuggingFace. The value of a foundation model compounds with its ecosystem. Alternative architectures start from zero and cannot leverage any existing fine-tune.

The Capital Asymmetry and JEPA's Only Path

The Q1 2026 capital deployment is stark: over $112B within the Transformer ecosystem (OpenAI $110B, Apple-Google $1B, LTX-2 DiT deployment) versus $2B for paradigm replacement (AMI Labs + World Labs). The 56:1 ratio means the Transformer ecosystem has overwhelming resources to absorb any innovation that alternative architectures produce, and to extend competitive leads through scaling.

AMI Labs targets commercial applications in 1-2 years and universal systems in 3-5 years. In those 3-5 years, OpenAI will deploy $110B of capital extending Transformer dominance. NVIDIA will release two or three more hybrid model generations, absorbing any useful innovations from JEPA research. The architectural insurgent must not only be better but dramatically better.

JEPA's only structural path to independent success is if its training objective — predicting in abstract representation space — is architecturally incompatible with Transformer integration. You cannot add a 'JEPA layer' to a Transformer the way you add a Mamba layer. If JEPA requires a fundamentally different training paradigm, the absorption pattern may not apply. The historical analogy would then be CNNs eventually giving way to Transformers — a paradigm that required wholesale replacement, not incremental absorption. VL-JEPA's 1.6B parameter efficiency advantage at small scale is the first empirical signal to watch for whether JEPA's scaling laws validate this hypothesis.

Architecture Innovation-to-Absorption Cycle: DiT and Mamba Follow the Same Pattern

Alternative architecture innovations reach commercial deployment inside the Transformer ecosystem faster than they can build independent alternatives

2022-06JEPA paper published

LeCun proposes Joint Embedding Predictive Architecture as post-Transformer paradigm

2023-03DiT paper published (Xie)

Diffusion Transformer architecture designed for visual generation

2024-02Sora uses DiT

OpenAI deploys Xie's DiT architecture in proprietary Transformer ecosystem

2024-05Mamba-2 paper published

State space model proposed as O(n) alternative to O(n^2) attention

2026-01LTX-2 ships DiT commercially

DiT reaches production-grade open-source deployment in Transformer ecosystem

2026-03Nemotron absorbs Mamba-2

NVIDIA hybrid model uses Mamba-2 as component layer within Transformer framework

2026-03AMI Labs raises $1.03B for JEPA

Commercial JEPA effort begins — 4 years after paper, still pre-product

Source: arXiv publication dates, product launch announcements

What This Means for Practitioners

For production architecture selection: Default to Transformer-based systems (GPT-5.4, Gemini, Claude) for the next 2-3 years. Adopt hybrid Mamba+Transformer architectures (Nemotron 3 Super) for agentic workloads requiring long context — the 91.75% RULER at 1M tokens advantage is real and accessible now.

For research-stage evaluation: Monitor VL-JEPA scaling results from Meta FAIR as the key empirical test of JEPA's architecture-level differentiation. If VL-JEPA performance holds at 7B parameters with proportional efficiency advantages, it becomes a legitimate 2027-2028 production consideration. If performance plateaus or scaling laws prove weaker than Transformer equivalents, the absorption trajectory is confirmed.

For investment and roadmap planning: JEPA and world model architectures carry genuine 3-5 year uncertainty. Do not build irreversible infrastructure dependencies on AMI Labs' timeline. The Transformer ecosystem's absorption engine is fast — DiT went from paper (2023) to production-grade open-source deployment (LTX-2.3, 2026) in 3 years. JEPA has been a paper since 2022 and is still pre-product. The clock is running on both sides simultaneously.