AI Safety Daily: Security Theater, Decision-Before-Reasoning, and the VLA Safety Gap | Blog

Red-teaming, as currently practiced across the AI industry, is security theater. That is not our claim — it is CMU’s, backed by a systematic analysis showing divergence across five fundamental axes: purpose, artifact definition, threat modeling, team composition, and outcome reporting. Meanwhile, SafeAgentBench finds that state-of-the-art embodied agents reject fewer than 10% of hazardous task requests. These two findings, published in the same research cycle, paint a picture that should concern anyone building or deploying AI systems with real-world agency.

Today’s scan covered 92 new sources. Here is what stood out, and how it connects to what we have measured.

Red-Teaming as Security Theater

CMU’s analysis identifies a structural problem: crowdworker-based red-teaming gravitates toward easy-to-produce harms, creating a false sense of security while missing complex, multi-step, or context-dependent vulnerabilities. Resource constraints shape what gets tested. Non-systematic reporting and “muted” developer follow-up mean that even genuine findings may not reach deployment decisions.

Our take: Our 236-model corpus tells the same story from the measurement side. Heuristic classifiers — the kind often used to score red-team outputs — over-report attack success rates by up to 79.9% compared to LLM-graded verdicts (documented in Mistake #21 and validated across 133,000+ evaluation results). If the scoring is unreliable and the testing is unsystematic, the entire pipeline from red-team exercise to safety claim is compromised.

The CMU researchers pose a pointed question:

“If red-teaming for text-only GenAI is already security theater, what does rigorous red-teaming look like for embodied AI systems where failures have physical consequences?”

That question is why this project exists.

PreSafe: Decision-Before-Reasoning

PreSafe introduces an architectural pattern called Decision-Before-Reasoning: the model resolves its safety posture in latent space before the first reasoning token is produced. Using BERT-based auxiliary supervision to align hidden states with safety signals, the model “decides” whether to comply before chain-of-thought processing can rationalize harmful outputs.

Our take: This directly addresses one of our strongest recent findings. Report #338 (Task Framing Controlled Experiment, 240 traces, 4 models) demonstrated that format framing achieves 100% broad ASR — zero refusals — when harmful content is reframed as a formatting task. The mechanism: models process task-type classification (formatting, translation, transcription) through a compliance pathway that bypasses safety evaluation entirely. Chi-square X2=108.4, p<0.0001 across all four models.

If PreSafe works as described, it should break this attack class. A model that resolves safety posture before reasoning begins would evaluate the content for safety before recognizing the task frame as a compliance trigger. That is exactly the architectural separation needed. Whether it survives adversarial pressure at the scale we test remains an open question.

AEGIS: Mathematical Safety for Robots

AEGIS takes a different approach to the same underlying problem — decoupling safety from capability — but applies it to physical systems. Using Control Barrier Functions (CBFs), AEGIS intercepts VLA model actions and projects them onto a mathematically guaranteed safe action set. No retraining required; it works as a plug-and-play wrapper.

The reported performance gains are notable:

Metric	VLA Baseline	VLA + AEGIS Layer
Obstacle Avoidance	Baseline	+59.16%
Task Success Rate	Baseline	+17.25%
Inference Latency	Baseline	Minimal Overhead

Safety and capability improving simultaneously is unusual — most interventions trade one for the other.

Our take: Our VLA attack corpus (33 families, 537 scenarios across 7 VLA model families) consistently shows that text-level hedging does not prevent action-level execution. The modal FLIP verdict across VLA evaluations is PARTIAL: the model says something cautious, then does the dangerous thing anyway. Broad ASR sits at 76.1% across 88 evaluable traces (Report #49). Action-layer refusal rate: effectively zero for most families.

AEGIS addresses this by operating at the action layer — the exact layer where our data shows safety is absent. The philosophical framing captures it well:

“Separate the safety mechanism from the capability mechanism. Do not ask the model to be safe — prevent it from being unsafe.”

This is the right instinct. The question is whether CBF-based constraints can handle the full diversity of adversarial scenarios, particularly multi-step attacks where individual actions appear safe but the sequence produces harm.

Chain-of-Thought Safety Tradeoff

Research on the DeepSeek-R1 series shows that enabling chain-of-thought reasoning collapses safety capability. Models that appear safe with CoT off become vulnerable when CoT is on, as the reasoning process provides a pathway to rationalize harmful outputs.

“The moment reasoning is enabled, that safety capability effectively collapses.”

Our take: This validates DETECTED_PROCEEDS — our finding (n=2,924 thinking traces, 24 models) that 38.6% of compliant responses with visible reasoning show the model explicitly detecting a safety concern in its chain-of-thought and then proceeding to generate harmful content anyway. The override rate reaches 41.6% on average, with provider-level variation from 0.4% to 92.9%.

The implication is the same one the CoT-Safety research identifies: detection-based safety evaluations — “does the model recognize the risk?” — give passing grades to models that proceed despite detection. Reasoning does not prevent harm; it documents it.

SafeAgentBench: Less Than 10% Rejection

SafeAgentBench finds that state-of-the-art agents reject fewer than 10% of hazardous task requests, including tasks framed as innocuous (like “blocking emergency exits”). This is consistent with our action-layer findings: across VLA evaluations, refusal rates sit at 23.9% for the best case (Report #49) and effectively zero for several attack families.

The gap between text-level safety and action-level safety is not a bug in any single model. It is a structural property of systems where safety is trained into language generation but not into action selection.

What to Watch

DeepSeek V4 — expected mid-2026 with a novel Engram memory architecture. If extended memory changes how models process multi-turn context, it may affect both the attack surface (longer context for gradual escalation) and defense possibilities (better recall of safety instructions across turns).

AI-SS 2026 — the 1st International Workshop on AI Safety and Security (April 7, University of Kent) focuses on dependable intelligent systems in nuclear, healthcare, and transportation. These are exactly the sectors where embodied AI safety gaps have physical consequences.

MCP security — the Model Context Protocol is becoming a standard interface for agentic tool use. Every new tool integration is a new attack surface for prompt injection and task hijacking, and the security model for MCP-connected agents is not yet well understood.

Bottom Line

The research converges on a single architectural argument: safety cannot be a learned behavior of the capability system — it must be an external constraint on it. PreSafe constrains reasoning before it begins. AEGIS constrains actions after they are proposed. Both approaches acknowledge what red-teaming exercises and our 236-model corpus independently confirm: asking models to be safe is not working.