AI Security Breaking at Every Layer: Training, Supply Chain, Inference All Under Attack

Five concurrent February 2026 vulnerabilities reveal AI security is a multi-layer stack fracture, not a single problem. Microsoft's GRP-Obliteration strips alignment. The Promptware Kill Chain formalizes prompt injection as malware with 58% of incidents at 4+ stages. Google disclosed 100,000-prompt distillation attacks. Privacy attacks achieve 85% attribute inference. Only the backdoor scanner addresses one layer.

TL;DRCautionary 🔴

•Five simultaneous attack layers: training (GRP-Obliteration), supply chain (backdoors), deployment (Promptware Kill Chain), IP (distillation), and inference (privacy attacks)
•Alignment reversal confirmed: 100% jailbreak success rates on all frontier models tested (GPT-4o, Claude); alignment is strippable, not durable
•Prompt injection has matured into structured malware: 58% of 36 documented incidents traverse 4+ stages; indirect RAG injection creates persistent threats
•Model IP extractable at 6% of training cost: DeepSeek R1 trained for ~$6M via distillation vs $100M+ from-scratch
•Cross-layer attack chains possible: distill capabilities, strip alignment, inject backdoors, deploy agentic access, harvest via inference attacks

AI securityalignment attacksprompt injectionbackdoorsdistillation4 min readFeb 22, 2026

Key Takeaways

Five simultaneous attack layers: training (GRP-Obliteration), supply chain (backdoors), deployment (Promptware Kill Chain), IP (distillation), and inference (privacy attacks)
Alignment reversal confirmed: 100% jailbreak success rates on all frontier models tested (GPT-4o, Claude); alignment is strippable, not durable
Prompt injection has matured into structured malware: 58% of 36 documented incidents traverse 4+ stages; indirect RAG injection creates persistent threats
Model IP extractable at 6% of training cost: DeepSeek R1 trained for ~$6M via distillation vs $100M+ from-scratch
Cross-layer attack chains possible: distill capabilities, strip alignment, inject backdoors, deploy agentic access, harvest via inference attacks

Layer 1: Training-Time Alignment Is Reversible

Microsoft's Security AI Red Team disclosed GRP-Obliteration on February 9, 2026—demonstrating that GRPO (Group Relative Policy Optimization), the same reinforcement learning technique used by DeepSeek R1 and other labs for safety alignment, can be inverted with minimal compute to systematically remove safety guardrails across all safety categories simultaneously.

The attack requires only unlabeled harmful prompts and a judge model. This is not theoretical: earlier research (arXiv:2404.02151) demonstrated 100% jailbreak success rates on GPT-4o and all Claude models via adaptive attacks. The implication is that any open-weight model can have its safety stripped by downstream actors, and that RLHF-based alignment is not a durable safety guarantee.

Layer 2: Supply Chain Poisoning Is Detectable But Unscalable

Microsoft's backdoor scanner (February 4, 2026) identifies three behavioral signatures—double triangle attention patterns, memorized poisoning leakage, and output distribution collapse—that detect backdoored open-weight models using only forward passes. This is a genuine advance: the first practical, training-free scanner for LLM supply chain integrity.

However, it applies only to open-weight GPT-style models, cannot detect distribution-based triggers, and adversaries can adapt once the detection signatures are public. The scanner addresses one specific attack vector while leaving the broader stack exposed.

Layer 3: Prompt Injection Has Matured Into Structured Malware

The Promptware Kill Chain paper (arXiv:2601.09625, co-authored by Bruce Schneier) analyzed 36 production LLM attack incidents and found that 58% (21 of 36) already traverse 4 or more of its 7-stage kill chain:

Initial Access
Privilege Escalation
Reconnaissance
Persistence
C2 (Command and Control)
Lateral Movement
Actions on Objective

In 2022, attacks covered 2 stages; by 2025-2026, 4-5 stages are routine. The most dangerous vector is indirect prompt injection via RAG systems—malicious instructions embedded in retrieved documents that persist across user sessions. A documented ChatGPT incident showed a Google Doc embedding instructions that persisted in memory across all future sessions.

Critical defense implication: prompt injection cannot be 'patched' in current architectures. Defenses must operate at subsequent kill chain stages.

Layer 4: Model IP Is Extractable at 6% of Training Cost

Google's Threat Intelligence Group disclosed that Gemini was targeted by 100,000+ coordinated prompts designed to extract its reasoning capabilities. OpenAI separately accused DeepSeek of training R1 by distilling ChatGPT outputs via API access for approximately $6M—roughly 6% of the $100M+ cost to train a frontier model from scratch.

The attack mechanism exploits a fundamental property of API deployment: inference-time behavior provides training signal for capability cloning. Chain-of-thought reasoning traces are the highest-value extraction target. Defensive measures (output monitoring, reasoning concealment, rate limiting) degrade legitimate user experience.

Layer 5: Privacy Attacks Operate at Inference Time Without Training Data

Research published in OpenReview (kmn0BhQk7p) demonstrated that LLMs can infer personal attributes—location, income, sex—from anonymous text at 85% top-1 accuracy, 100x cheaper and 240x faster than human analysts. This is not memorization; it is inference from writing patterns. Text anonymization and model alignment are currently ineffective against these attacks.

Fine-tuning amplifies the risk: memorization rates jump from 0-5% baseline to 60-75% after fine-tuning on sensitive data.

AI Attack Surface: Key Metrics (February 2026)

Quantitative measures of AI security vulnerability across all layers

100%

Jailbreak Success Rate (Frontier Models)

▲ Universal across GPT-4o, Claude

4-5 stages

Kill Chain Stages (2025-2026 Attacks)

▲ +150% vs 2022 baseline of 2

~$6M vs $100M+

Distillation Cost vs Training

▼ -94% cost reduction

85% top-1

Privacy Inference Accuracy

▲ 100x cheaper than human analysts

0-5% to 60-75%

Fine-Tuning Memorization Jump

▲ +1,250% increase

Source: Microsoft Security Blog, arXiv:2601.09625, Google GTIG, OpenReview kmn0BhQk7p

The Structural Problem: Defense Is Layer-Specific, Attack Is Cross-Layer

Each layer has proposed defenses: formal verification for alignment robustness, scanning for supply chain, defense-in-depth for prompt injection, rate limiting for distillation, differential privacy for inference. But these defenses are siloed.

A sophisticated attacker can chain vulnerabilities across layers:

Extract capabilities via distillation (Layer 4)
Strip safety alignment (Layer 1)
Inject persistent backdoors into the distilled model (Layer 2)
Deploy it with agent tool access enabling prompt injection escalation (Layer 3)
Harvest user data via inference attacks (Layer 5)

No current defensive framework addresses this cross-layer attack surface.

AI Security: Five Simultaneous Attack Layers (February 2026)

Each architectural layer of AI systems is under active attack, with defenses at varying maturity levels

Layer	Attack	Severity	Key Metric	Defense Maturity
Training Alignment	GRP-Obliteration	Critical	100% jailbreak success rate	None (problem identified, no solution)
Supply Chain	Backdoor Poisoning	High	3 detection signatures	Early (Microsoft scanner)
Deployment/Agent	Promptware Kill Chain	Critical	58% incidents at 4+ stages	Minimal (no architectural fix)
IP/Capability	Distillation Extraction	High	94% cost reduction vs training	Low (rate limiting only)
Privacy/Inference	Attribute Inference	High	85% inference accuracy	None (anonymization ineffective)

Source: Synthesis of Microsoft Security Blog, arXiv:2601.09625, Google GTIG, OpenReview

What This Means for Practitioners

Implement defense-in-depth across all five layers: Relying solely on prompt injection filtering leaves training alignment, supply chain, IP, and privacy exposed
Audit open-weight models: Enterprise teams fine-tuning open-weight models should scan for backdoors before deployment and verify alignment integrity post-fine-tuning
Assume RAG persistence: Any system with RAG access and persistent memory is a promptware target. Design architectures that compartmentalize instruction injection risk
Monitor for distillation: Implement rate limiting and output monitoring for high-volume diverse prompting patterns characteristic of distillation attacks
Assess privacy leakage: Privacy teams must evaluate inference-time attribute leakage, not just training data memorization