Key Takeaways
- BDH achieves 97.4% on Sudoku Extreme while o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet all score approximately 0% — categorical transformer failure
- BDH uses only 5% sparse activation vs transformers' dense attention, inherently more compute-efficient per inference step
- BitNet ternary weights (-1, 0, 1) reduce VRAM by 77.8% and enable 13B parameter models on iPhone 16 alone
- Sora required $130 per 10-second clip, proving transformer-based generation is commercially unviable at scale
- Combining sparse activation with extreme quantization could yield models both architecturally capable and edge-deployable
Three Signals Point to Architecture Breakthrough
BDH achieves 97.4% on Sudoku Extreme while o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet all score approximately 0% — a categorical failure mode of the transformer architecture. BitNet's ternary weights (-1, 0, 1) reduce VRAM by 77.8% and enable 13B parameter models on iPhone 16 alone. Sora required $130 per 10-second video clip in transformer-based compute. These are not isolated phenomena. They suggest a structural limit in the transformer architecture.
Sudoku Extreme Accuracy: Post-Transformer vs Transformer Models
BDH's categorical superiority on constraint-satisfaction problems exposes a structural transformer limitation
Source: Pathway internal benchmark, March 2026
BDH's Categorical Superiority Exposes Transformer Limitation
Sudoku Extreme is a constraint-satisfaction problem with a discrete solution space. BDH solves it 97.4% of the time. Four of the most advanced LLMs on Earth solve it approximately 0% of the time. This is not a benchmark point difference. It is categorical: BDH can do something that transformers fundamentally cannot.
Transformers use dense attention: every token attends to every other token. For constraint-satisfaction problems with clear logical constraints, this creates pathological behavior — the model spreads probability across invalid solutions because it cannot efficiently track hard constraints. BDH uses biologically-inspired sparse activation (only 5% of neurons fire), allowing efficient constraint tracking. Jon Krohn's independent analysis confirms BDH's categorical superiority and the structural nature of transformer limitations.
Sparse Activation and Extreme Quantization Are Complementary
BDH's 5% sparse activation density and BitNet's ternary quantization are solving different efficiency problems: BitNet solves the precision problem (ternary weights eliminate 95% of multiplications), BDH solves the density problem (sparse activation allows selective, targeted attention). Combining them could yield models that are both architecturally capable (solving constraint problems) and edge-deployable (running on smartphones).
Sora Proves Transformer Inference Costs Are Unviable
Sora's $130 per 10-second video clip is a 2-3 orders of magnitude gap between unit economics and consumer willingness to pay. A consumer paying $10-20 per video would need OpenAI to reduce inference cost by 13-20x. If alternative architectures can generate video at 1/5th the transformer cost, they will displace transformers regardless of benchmark metrics.
Adoption Timeline: Conservative to Optimistic
BitNet edge deployment: Now. The code is open-source, benchmarks are published, VRAM savings are real. BDH or similar post-transformer architectures: 12-18 months. But this requires scaling validation, independent reproduction, and general capability assessment. BDH is at GPT-2 scale (~125M parameters). Does it scale to GPT-3 scale? Unknown.
What This Means for Practitioners
ML engineers working on constraint-satisfaction, optimization, or scheduling problems should evaluate non-transformer architectures immediately. For edge deployment teams, BitNet quantization is production-ready now. For research teams, the case for architectural diversity is quantitative: architecture matters, and monoculture strategy is risk.