The Convergence Paradox: Creativity Loss, Safety Failure, Interpretability Crisis

AI simultaneously homogenizes creative output (g = -0.86), fails safety audits (35/39 models degrade 19.81pp under jailbreak), and defies interpretation (SAEs reduce performance 10-40%). All three failures share the same root: statistical optimization toward central tendency.

TL;DRCautionary 🔴

•AI-assisted creativity reduces collective idea diversity by g = -0.86 (large effect size) because models are trained to predict probable outputs, not novel ones
•35 of 39 models (89.7%) fail jailbreak testing with average 19.81pp safety degradation, revealing safety training is pattern-matching against known attacks, not deep robustness
•DeepMind deprioritized sparse autoencoders after finding 10-40% downstream performance degradation; the field lacks consensus on interpretability methodology
•All three failures (creativity loss, safety brittleness, interpretability tools failing) emerge from the same mechanism: imposing simplifying assumptions (statistical average, linear decomposition) on fundamentally complex systems
•Test-time scaling amplifies the central tendency problem: more computation produces more confident but not more correct or creative answers

ai-creativity-paradoxai-safety-failuresparse-autoencodersjailbreak-testinginterpretability-crisis6 min readFeb 22, 2026

Key Takeaways

AI-assisted creativity reduces collective idea diversity by g = -0.86 (large effect size) because models are trained to predict probable outputs, not novel ones
35 of 39 models (89.7%) fail jailbreak testing with average 19.81pp safety degradation, revealing safety training is pattern-matching against known attacks, not deep robustness
DeepMind deprioritized sparse autoencoders after finding 10-40% downstream performance degradation; the field lacks consensus on interpretability methodology
All three failures (creativity loss, safety brittleness, interpretability tools failing) emerge from the same mechanism: imposing simplifying assumptions (statistical average, linear decomposition) on fundamentally complex systems
Test-time scaling amplifies the central tendency problem: more computation produces more confident but not more correct or creative answers

The Shared Root Cause: Statistical Central Tendency

Three of this week's most significant AI research findings appear unrelated: the creativity paradox (AI boosts individual creative scores while reducing collective diversity), the interpretability crisis (SAEs fail as safety-relevant analysis tools), and the jailbreak resilience gap (90% of models degrade under adversarial conditions). Analyzing them independently misses their shared underlying mechanism: all three emerge from AI systems' fundamental optimization toward statistical central tendency.

Creativity: Optimizing Toward the Mean

The Nature Scientific Reports study (February 2026) tested ChatGPT-4o, DeepSeek-V3, and Gemini 2.0 on divergent and convergent thinking tasks. All three AI models outperformed the human average. But the Science Advances companion study revealed the paradox: AI assistance to individual creators increases their creative scores while decreasing the cross-individual diversity of outputs. The meta-analytic effect size is g = -0.86 — a large effect by psychological standards.

The mechanism is inherent to how LLMs work. They are trained to predict the most probable next token given training data. Even 'creative' outputs are statistically improbable continuations — but improbable relative to a distribution centered on human averages. The model cannot produce outputs that are genuinely outside the training distribution because it has no mechanism for doing so. Every output is a weighted combination of patterns in training data. When millions of users generate content with the same model, the results cluster around the same high-probability regions of idea space.

The 'creativity scar' finding deepens the concern: after participants stopped using AI assistance, their individual creative performance dropped while homogeneity continued to increase. AI dependence substitutes for creative capacity rather than building it.

Interpretability: Linear Tools for Non-Linear Systems

DeepMind's negative SAE results share the same structural problem. Sparse autoencoders assume that neural network representations are composed of linearly separable features (the 'superposition' hypothesis). When this assumption is approximately true, SAEs work. When it fails — as it demonstrably does for complex capabilities like harmful intent detection — SAE reconstruction degrades model performance by 10-40%.

The parallel to the creativity problem is precise: SAEs impose a linear decomposition on a non-linear system, just as LLMs impose statistical central tendency on a creative process that requires statistical outliers. Both tools work well within the regime where their assumptions hold (simple features, average creativity) and fail catastrophically outside it (emergent capabilities, genuine novelty).

Anthropic's 34-million-feature Scaling Monosemanticity work looked impressive because it operated in the linear regime — identifying interpretable features like 'Golden Gate Bridge' or 'code structure.' But the features that matter for safety (deceptive reasoning, harmful intent, refusal behavior) are precisely those that emerge from non-linear interactions in post-training (RLHF/DPO). SAEs trained on pretraining data lack latents for concepts that only exist in chat-tuned models.

Safety: Training Against Known Distributions

The MLCommons jailbreak results complete the pattern. 35 of 39 T2T models degrade by average 19.81pp under jailbreak conditions across 12 hazard categories. Safety training teaches models to resist known attack patterns — the attacks in the training data. But adversarial attacks are by definition attempts to find inputs outside the defended distribution. This demonstrates that safety is a surface-level pattern match, not a deep capability.

The jailbreak taxonomy (template-based, encoding-based, optimization-based) is itself a statistical decomposition of attack space. Once published, labs will optimize against these categories — pushing adversaries toward novel attacks that fall outside the taxonomy's coverage. The same central-tendency optimization that makes LLMs uncreative makes safety training brittle: the model learns to resist probable attacks, not possible attacks.

The Convergence of Convergence: More Compute Amplifies the Problem

The deeper insight is that these three problems are getting worse together, not better, because all three are amplified by the same industry trend: scaling through statistical optimization.

Test-time scaling (TTS) exemplifies this. TTS improves model performance by generating more samples and selecting the best (best-of-N) or refining answers iteratively (process reward models). The research showing TTS fails for knowledge-intensive tasks (increasing hallucinations rather than accuracy) is the capability-domain version of the same problem: more compute produces more confident but not more correct answers.

The scaling paradigm that works for reasoning (where there IS a correct answer) fails for domains where diversity IS the value. More computation drives all three failures in the same direction: more confident pattern-matching instead of novel reasoning, more aggressive hallucination instead of grounded knowledge, more homogenized creativity instead of divergent thinking.

Practical Implications by Domain

For creative applications: Implement diversity-forcing mechanisms (temperature variation, model mixing, constrained prompt engineering) at the system level. Individual users will not voluntarily reduce AI assistance quality to maintain collective diversity — this is a system design problem, not a user behavior problem.

For safety: Assume safety training is a minimum bar, not a guarantee. Deploy runtime monitoring, adversarial testing with MLCommons-style taxonomy-driven frameworks, and defense-in-depth architectures. The 19.81pp resilience gap is a design parameter, not a bug to fix.

For interpretability: Pragmatically pivot from 'understanding models' (mechanistic interpretability) to 'measuring model behavior' (empirical safety testing). DeepMind's shift to model diffing reflects this — comparing what models do rather than why they do it. For production safety, behavioral measurement is more actionable than mechanistic understanding.

Domain	Central Tendency Problem	Observable Failure	Practical Solution	Root Fix
Creativity	Regression toward training data mean	g = -0.86 diversity loss	Model mixing, temperature scaling, diversity rewards	Training on genuinely diverse datasets with diversity-rewarding objectives
Safety	Learning patterns of known attacks	35/39 models degrade 19.81pp under jailbreak	MLCommons testing, runtime monitoring, defense-in-depth	Shift from pattern-matching safety to mechanistic alignment
Interpretability	Linear tools applied to non-linear systems	SAEs degrade downstream performance 10-40%	Behavioral measurement, model diffing, continuous monitoring	Develop non-linear interpretability tools or accept opacity

The Optimistic Counter: Architecture Can Fix This

The creativity paradox may be an artifact of current model architectures and training methods. Models trained on genuinely diverse datasets with diversity-rewarding objectives could potentially produce statistically divergent outputs. The pessimistic response: next-token prediction on any fixed dataset will always regress toward the distribution's center of gravity. Architectural solutions (diffusion models, energy-based models) may fundamentally resist this tendency, but transformer-based LLMs may be intrinsically limited in their capacity for genuine novelty.

Investment and Research Implications

Companies that build diversity-preserving AI tools gain differentiation in creative markets
MLCommons framework adoption becomes a compliance baseline for safety-critical AI
Anthropic's interpretability bet carries higher risk given DeepMind's pivot — but higher reward if SAEs can be made to work for reasoning models
Energy-based models and diffusion approaches may offer architectural solutions to the central tendency problem

What This Means for Practitioners

Implement three layers of mitigation:

System-level diversity mechanisms: Use model ensembles (different models producing different outputs), temperature/sampling variation, and constrained generation to maintain output diversity even as individual model quality improves.

Behavioral safety testing: Adopt MLCommons v0.7 jailbreak testing as quarterly baseline. Treat the 19.81pp resilience gap as a known risk and build defense-in-depth (runtime filtering, approval workflows, output monitoring).

Pragmatic interpretability: Stop expecting to understand model internals for production safety decisions. Focus instead on measuring model behavior: Does it hallucinate on out-of-distribution questions? Does it maintain safety under adversarial probing? Can we construct minimal adversarial examples that break it? These behavioral measurements are actionable; mechanistic understanding is not.

Sources

Sources are listed separately for frontend rendering and SEO.

Three Manifestations of the Central Tendency Problem

Creativity, safety, and interpretability all fail when AI systems encounter inputs outside their training distributions

g = -0.86

Creativity Diversity Loss

▼ large effect size

19.81pp

Safety Resilience Gap

▼ avg drop under jailbreak

10-40%

SAE Performance Loss

▼ downstream degradation

Source: Nature / MLCommons / DeepMind