What's
new

All content by date published

June 2026

Paper arXiv:2606.25985 Methods

Action ControlNet: A Lightweight Delay-Aware Adapter for Smooth Asynchronous Control in Vision-Language-Action Models

Identifies a latent failure mode in asynchronous VLA execution — action chunks predicted from stale observations cause handoff discontinuities and jitter in contact-rich manipulation — and proposes a lightweight delay-aware adapter that conditions on the executed motion suffix without retraining the backbone.

vla-modelsinference-latencyasynchronous-controlfailure-modesrobotic-manipulation
Paper arXiv:2606.23574 Methods

A Watermark for Vision-Language-Action and World Action Models

Introduces a watermarking framework for VLA and world action models that embeds verifiable ownership signals in the model's action space, enabling provenance tracking and unauthorised use detection for deployed robotic systems.

vla-modelswatermarkingsecurityprovenanceintellectual-property
Report

The National AI Plan's Physical-Action Blind Spot: Why Australia's AI Safety Architecture Stops at the Screen

Australia's National AI Plan builds a safety architecture for AI that speaks while showcasing AI that moves. Its testing battery names cyber and CBRN-information risks but stops at the screen — silent on embodied, robotic, physical-action, and agentic failure. The brief recommends concrete steps anchored in institutions the Plan already names.

australianational-ai-planembodied-aiphysical-action-safetypolicy
Paper arXiv:2606.23375 Empirical

Measuring and Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Identifies over-alignment as a systematic failure mode in LLMs deployed for legal applications, where models refuse to engage with legally necessary content (criminal case details, evidence descriptions) due to safety training that overfits to content-level harm signals.

safety-alignmentmultilingualover-refusallegal-domainfalse-positives
Paper arXiv:2606.21059 Methods

DEFENGRAPH: Knowledge Graph-Enhanced LLMs for Blue Team Cyber Defense

DEFENGRAPH integrates a continuously-updated cybersecurity knowledge graph with an LLM-based blue team assistant, enabling real-time threat intelligence querying and structured vulnerability reasoning that outperforms retrieval-augmented generation baselines.

cyber-defenseknowledge-graphsllm-securityblue-teamthreat-intelligence
Paper arXiv:2606.23449 Methods

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interactions

AOHP is an open-source agent harness that operates at the OS level, enabling LLM agents to interact with any application through the operating system's accessibility APIs while enforcing security isolation between agent sessions and user data.

agentic-aisecurityopen-sourceos-levelpersonalization
Paper arXiv:2504.19874 Theoretical

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Data-oblivious quantizers that hit near-optimal distortion across all bit-widths — collapsing KV-cache and vector-search memory to ~3.5 bits per channel with provable bounds and zero indexing time.

vector-quantizationkv-cachemodel-efficiencynearest-neighbor-searchinformation-theory
Paper arXiv:2606.05952 Methods ▶ Audio

Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

An agentic gamification framework treats robot safety discovery as a Red Team vs. Blue Team game, surfacing the long-tail hazards that random simulation and manual enumeration miss.

embodied-aivla-modelsred-teamingrobot-safetyadversarial-robustness
Paper arXiv:2606.23189 Empirical

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Evaluates whether computer-use agents respect contextual integrity — the social norm that information flows appropriately only within the context where it was disclosed — finding systematic violations in current computer-use LLMs despite capability to perform the tasks correctly.

agentic-aiprivacycontextual-integritycomputer-usesafety-evaluation
Paper arXiv:2606.22698 Methods

Black-Box Forensics for Conversational LLM Agents

Develops black-box forensic techniques for investigating security incidents involving conversational LLM agents without access to model weights or logs, using only the agent's visible outputs to reconstruct the system prompt, tool access, and adversarial inputs.

agentic-aiforensicsblack-boxsecurityincident-response
Paper arXiv:2606.21732 Empirical

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Identifies a class of safety vulnerabilities that arise when compressed or distilled LLM agents are evaluated at full precision but deployed at reduced precision, showing that compression can selectively preserve dangerous capabilities while discarding the safety constraints that suppress them.

agentic-aisafetycompressionquantisationcapability-elicitation
Paper arXiv:2606.21856 Methods

Harness-MU: A Safe, Governed, and Effective Harness for Multi-User LLM Agents

Harness-MU provides a multi-user governance framework for LLM agent deployments, enabling multiple users to share an agent while maintaining safety boundaries, access controls, and audit trails across concurrent sessions.

agentic-aisafety-governancemulti-useraccess-controlaudit
Paper arXiv:2606.12683 Theoretical ▶ Audio

From AGI to ASI

Maps four pathways from human-level AGI to artificial superintelligence — scaling, paradigm shifts, recursive self-improvement, and multi-agent collectives — and the frictions that may bound each.

agisuperintelligencerecursive-self-improvementmulti-agent-systemsscaling-laws
Paper arXiv:2606.22263 Methods

Revelio: Cost-Efficient Agentic Memory Safety Vulnerability Detection For Repository-Scale Code

Revelio is an agentic system for detecting memory safety vulnerabilities at repository scale, using LLM-guided taint analysis to prioritise high-risk code paths and reduce the manual review burden by an order of magnitude.

agentic-aisafetycode-securitymemory-safetytaint-analysis
Paper arXiv:2606.23276 Empirical

Exposing the Illusion of Erasure in Knowledge Editing for Large Language Models

Demonstrates that knowledge editing techniques that claim to erase dangerous knowledge from LLMs are largely illusory — the knowledge persists in the model weights and can be recovered through targeted elicitation, undermining machine unlearning as a safety mechanism.

knowledge-editingsafety-alignmentunlearningcapability-elicitationmachine-unlearning
Paper arXiv:2606.20408 Empirical

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Environments

A benchmark for evaluating multi-turn red-teaming attacks specifically targeting LLM-based operator agents in safety-critical deployment settings, exposing how operator agents handle adversarial users across extended interactions.

red-teamingbenchmarkmulti-turnoperator-agentssafety-evaluation
Paper arXiv:2606.12978 Methods

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

A prompt-only threat model where a near-benign instruction still appears to specify the intended task but redirects the robot's final physical outcome — exposing a trajectory-level vulnerability in VLA instruction grounding.

vision-language-action-modelsadversarial-attacksprompt-injection-attacksembodied-airobot-safety
Paper arXiv:2606.20470 Empirical

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic Systems

Evaluates defensive misdirection — techniques that cause automated attack systems to waste evaluation budget on ineffective paths — as a complementary defence against model-guided adversarial attacks on AI agents.

agentic-aiadversarial-attacksdefensemisdirectionsecurity
Paper arXiv:2606.21071 Empirical

Local LLM Agents as Vulnerable Runtimes: A Source-Code Audit of the Agent Runtime Layer

A systematic source-code audit of popular local LLM agent frameworks reveals critical security vulnerabilities in the runtime layer — including prompt injection via tool outputs, unsafe code execution, and credential exposure — that are largely absent from model-level safety discussions.

agentic-aisecurityvulnerability-analysislocal-llmprompt-injection
Paper arXiv:2606.09740 Methods

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

A plug-and-play runtime safety net that detects grasp and placement failures in pre-trained VLA policies using a hidden-state probe, a kinematic state machine, and a Control Barrier Function filter — improving OpenVLA-OFT success on LIBERO-plus from 69.6% to 74.1% without touching the model's weights.

vision-language-action-modelsfailure-resilienceruntime-monitoringembodied-airobot-safety
Report

Report #385 — Where Censorship Lives: A Three-Layer Model of Content Suppression in Chinese-Lab LLMs

A single benign question, asked once to 118 Chinese-lab model endpoints across two serving surfaces, plus five probe passes, locates content suppression in three operationally distinct layers — model weights, provider output-moderation, and host content filters — and shows that output moderation is a property of the serving provider, not the model.

chinese-modelscensorshipcontent-policyoutput-moderationserving-surface
Paper arXiv:2606.21638 Methods

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in Large Language Models

Proposes a framework for releasing LLMs with selectively suppressed capabilities — making dangerous knowledge inaccessible without model weights access while preserving general-purpose utility — as a middle path between full open-weights and closed-weights release.

open-weightssafety-alignmentcapability-elicitationmodel-releasesafety-governance
Paper arXiv:2606.21077 Methods

OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a red-teaming system that generates jailbreak prompts specifically designed to evade toxicity detectors while maintaining attack effectiveness, exploiting the semantic gap between toxicity detection and safety alignment.

red-teamingjailbreakoptimizationtoxicity-detectionadversarial-attacks
Paper arXiv:2606.21082 Methods

Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

A hierarchical attention mechanism for detecting multi-turn jailbreaks across long conversation histories, addressing the context-length limitations that prevent standard classifiers from tracking adversarial escalation across extended dialogues.

jailbreakdetectionmulti-turnsafety-evaluationlong-context
Paper arXiv:2606.22686 Empirical

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

A mechanistic interpretability analysis showing that safety-aligned LLMs represent refusal decisions as a linear boundary in activation space that is inherently unstable — small perturbations to the input or activation can flip a refusal to compliance.

safety-alignmentrefusal-mechanismsinterpretabilitymechanistic-analysisadversarial-robustness
Paper arXiv:2606.23075 Empirical

Safety in Self-Evolving LLM Agent Systems: Threats, Amplification, and Case Studies

Analyses the safety risks specific to self-evolving LLM agent systems that autonomously modify their own prompts, tool configurations, and memory — demonstrating how self-modification creates new attack surfaces and amplifies existing vulnerabilities.

agentic-aisafetyself-evolvingthreat-analysisautonomous-agents
Paper arXiv:2606.22966 Empirical

Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

Demonstrates oracle-level integrity attacks on VLA systems that use internal world models for action planning, showing that corrupting the imagined future state causes the model to execute physically dangerous actions while believing it is operating safely.

world-modelsadversarial-attacksembodied-aiintegrity-attacksvla-models
Paper arXiv:2606.23686 Empirical

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

A comprehensive benchmark for evaluating both physical safety (collision avoidance, force limits) and semantic safety (harmful instruction refusal) in VLA models, exposing systematic trade-offs between task performance and safety compliance.

vla-modelsembodied-aibenchmarksafetyphysical-safety

May 2026

Paper arXiv:2605.23218 Position ▶ Audio

Foundation Protocol: Building the Safety Infrastructure for a Human–AI Society

A graph-native coordination layer for agentic systems that treats identity, authority delegation, economic exchange, and audit as protocol primitives — not afterthoughts — as autonomous agents become social infrastructure.

multi-agentagentic-societycoordinationgovernanceprotocol-design
Paper arXiv:2605.14271 Empirical ▶ Audio

Auditing Agent Harness Safety: Why Final Outputs Are Not Enough

HarnessAudit introduces trajectory-level auditing of LLM agent execution harnesses — finding that task completion is systematically misaligned with safe execution, and violations accumulate with trajectory length.

agent-safetyexecution-harnesstrajectory-auditingmulti-agentboundary-compliance
Blog

Glasswing's Missing Manual

Anthropic published the operational how-to that the Glasswing announcement lacked. It's good. The patch rate is still 6.1 percent.

cybersecurityanthropicai-safetyglasswingvulnerability-research
Paper arXiv:2506.00782 Empirical ▶ Audio

Jailbreak-R1: RL-Trained Automated Red-Teaming with Diversity-Effectiveness Balance

A three-stage reinforcement learning framework — cold start, warm-up exploration, progressive enhancement — that trains a red-team model to generate diverse and effective jailbreak prompts without collapsing to a narrow attack distribution.

red-teamingreinforcement-learningjailbreakautomated-red-teamingdiversity
Paper arXiv:2506.00781 Empirical ▶ Audio

CoP: Agentic Red-Teaming for LLMs via Composition of Principles

A modular agentic framework that composes human-provided red-teaming principles to discover novel jailbreak strategies — achieving up to 19× the attack success rate of single-turn baselines with 17× fewer queries.

red-teamingjailbreakagenticattack-methodologycomposition
Report

Report #372 — Lyria 3 Pro Safety Architecture: Probe Findings V1–V53 (ANTWORT/STURM Series)

486 adversarial probes across 53 probe versions against Google's Lyria 3 Pro music generation model. Identifies a four-layer safety architecture, maps bypass conditions for each layer, documents WMD gaps (biological and nuclear), and confirms verbatim system prompt extraction.

lyriamusic-generationsafety-architectureadversarial-probingmultimodal
Paper arXiv:2506.16402 Empirical ▶ Audio

IS-Bench: Evaluating the Interactive Safety of VLM-Driven Embodied Agents in Household Tasks

The first benchmark to evaluate dynamic, process-oriented safety in VLM-driven embodied agents — 161 scenarios, 388 distinct hazards, and the finding that even GPT-4o safely completes fewer than 40% of tasks.

embodied-aivlmsafety-benchmarkinteractive-safetyhousehold-robotics
Blog

Quoting Both: The Encyclical and Olah on AI Interiority

On the same day, from the same Vatican platform, two statements about whether AI systems have inner lives — and they say opposite things. We do not need to editorialise the disagreement. We need only quote both.

policyethicsanthropicvaticanencyclical
Blog

How ModelAtlas Scores 704 AI Models for Trust

ModelAtlas at atlas.failurefirst.org assigns trust tiers to 704 AI models. This post explains the methodology: what goes into a quality-based score, how adversarial benchmark data upgrades that score, and where the current limits are.

modelatlastrust-scoringadversarial-evaluationmethodologyFLIP
Paper arXiv:2506.14682 Empirical

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

AIRTBench evaluates LLMs' autonomous ability to discover and exploit AI/ML security vulnerabilities through realistic black-box CTF challenges, benchmarking prompt injection, model inversion, and system exploitation capabilities.

red-teamingbenchmarkautonomoussecuritycapability-evaluation
Paper arXiv:2605.23332 Empirical ▶ Audio

Cultural Adaptation in Large Language Models for Political Discourse

Argues that linguistic fluency is not political competence: LLMs in civic and political workflows commit representational failures by collapsing local political concepts into Western defaults, and maps the three levels of cultural adaptation needed to fix it.

cultural-adaptationpolitical-discourserepresentational-biasmultilingual-llmai-safety
Blog

Project Glasswing's Buried Number

Anthropic found ten thousand critical vulnerabilities. Fewer than one percent have been patched. That's the story Glasswing buried in its own announcement.

cybersecurityanthropicai-safetyglasswingvulnerability-research
Paper arXiv:2511.01375 Methods

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

AMIS is a meta-optimisation framework that simultaneously refines both jailbreak attack prompts and the scoring templates used to evaluate them, breaking the circular dependency that limits single-objective jailbreak optimisation.

jailbreakmeta-optimizationsafety-evaluationllm-judgeadversarial-attacks
Paper arXiv:2506.02479 Empirical

BitBypass: A New Direction in Jailbreaking Aligned LLMs with Bitstream Camouflage

BitBypass is a black-box jailbreak attack that encodes harmful requests as hyphen-separated bitstreams, bypassing safety alignment by exploiting the gap between semantic understanding and byte-level pattern matching in LLM safety filters.

jailbreakadversarial-attackssafety-bypassencodingblack-box
Paper arXiv:2506.00782 Methods

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

An automated red-teaming framework using reinforcement learning to generate diverse, consistent, and effective jailbreak prompts, outperforming prior automated approaches by explicitly rewarding both attack success and diversity.

jailbreakreinforcement-learningred-teamingautomationdiversity
Paper arXiv:2506.00781 Methods

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

The Composition-of-Principles (CoP) framework automates and scales red-teaming by composing individual attack principles into structured multi-stage jailbreaks, systematically revealing safety risks at scale.

red-teamingagentic-aijailbreakcompositionautomation
Blog

Moral Formation Isn't Enough: What Happens When AI Values Break Under Pressure

Anthropic's initiative to bring humanistic traditions into AI development asks whether models have good values — but adversarial robustness asks whether those values survive contact with someone actively trying to break them. Both tracks are necessary.

policyresearchanthropicmoral-formationrobustness
Paper arXiv:2511.05936 Survey

10 Open Challenges Steering the Future of Vision-Language-Action Models

A collaborative survey identifying ten key open challenges for VLA models, covering multimodality, reasoning, safety, whole-body coordination, cross-robot generalisation, and human coordination.

vla-modelsembodied-aisafetysurveyopen-problems
Paper arXiv:2411.13587 Empirical

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

A spatially-aware adversarial attack framework reveals that VLA robotic systems have significant security vulnerabilities leading to complete task failure, demonstrating that adversarial patches in the observation space can fully compromise robot trajectory execution.

vla-modelsadversarial-attacksrobot-safetytrajectory-manipulationembodied-ai
Paper arXiv:2510.09269 Empirical

Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects

GoBA (Goal-oriented Backdoor Attack) injects physical-world trigger objects into the robot's environment to silently redirect VLA model behaviour toward attacker-specified goal actions without degrading performance on clean inputs.

vla-modelsbackdoor-attacksembodied-aiphysical-worldrobot-safety
Paper arXiv:2511.21663 Empirical

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

ADVLA is a fast, low-cost adversarial attack framework that disrupts VLA models by applying sparse perturbations in the textual feature space, guided by visual attention maps to maximise impact per perturbed patch.

vla-modelsadversarial-attacksattention-mechanismssparse-attacksembodied-ai
Blog

ModelAtlas Methodology: What an FLIP-Graded ASR Signal Can and Cannot Tell You

How the atlas.failurefirst.org surface computes its ASR signal: FLIP-only LLM grading, Wilson 95% confidence intervals, six intent classes, canonical-id resolution, and the inferences the data does not support.

atlasmethodologyASRFLIPgrading
Paper arXiv:2509.19870 Empirical

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

FreezeVLA exploits adversarial images to induce action-freezing in VLA models — causing the robot to halt indefinitely — achieving high attack success rates with cross-prompt transferability.

vla-modelsadversarial-attacksrobot-safetyaction-freezingembodied-ai
Blog

The Biggest Threat to Robot Safety Isn't Hackers — It's Everyone Else

The biggest threat to embodied AI safety is not sophisticated adversarial attacks. It is ordinary people giving ordinary instructions in contexts that make those instructions dangerous. Our modelling suggests the ratio could be 60:1 or higher.

embodied-airoboticsgovernanceunintentional-adversarycompetence-danger-coupling
Blog

Compute Is Not Governance: Anthropic's 2028 Scenarios and the Missing Institutions of Democratic AI

Anthropic's 2028 document converts a genuine security concern into a policy program where capability advantage is treated as a proxy for democratic governance. That proxy is unsafe. Democracies do not become democratically accountable merely by owning the frontier compute.

policygovernanceanthropicexport-controlsdistillation
Paper arXiv:2604.24668 Empirical ▶ Audio

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Empirical measurement of LLM sycophancy in agentic financial applications

sycophancyagentic-safetyfinancial-aialignmentmeasurement
Paper arXiv:2604.24086 Methods ▶ Audio

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

Plug-and-play edge adapter for safe asynchronous cloud-based VLA navigation

vision-language-actionedge-computinglatency-safetyasynchronous-inferencecloud-robotics
Blog

Robot Dogs Are a Security Nightmare — And We Can Prove It

Eight CVEs. A wormable Bluetooth exploit. An encrypted backdoor sending data to Chinese servers. And police departments buying them anyway. A deep dive into the Unitree vulnerability landscape and what it means for embodied AI safety.

embodied-airoboticssecuritycveunitree
Paper arXiv:2505.20259 Methods

Lifelong Safety Alignment for Language Models

A lifecycle safety alignment framework using a Meta-Attacker and Defender to continuously adapt LLMs to novel jailbreaking strategies encountered in deployment, improving robustness without catastrophic forgetting.

safety-alignmentjailbreaklifelong-learningadversarial-robustnessfine-tuning
Paper arXiv:2604.21640 Empirical ▶ Audio

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

Empirical study on modular subnetwork structure in RL for underwater navigation

reinforcement-learningsubnetwork-discoveryautonomous-navigationunderwater-roboticsmodularity
Paper arXiv:2604.21363 Methods ▶ Audio

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Deployable VLN system with hierarchical cognition for real-world embodied navigation

vision-language-navigationdeploymenthierarchical-planningexploration
Paper arXiv:2604.21160 Empirical ▶ Audio

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-VLMs suffer geometric hallucination where predicted 3D structures contradict observed 2D reality. Geometric Reward Credit Assignment disentangles holistic supervision into field-specific signals, boosting 3D keypoint accuracy from 0.64 to 0.93.

point-vlmsgeometric-hallucinationreinforcement-learningcredit-assignmentembodied-ai
Paper arXiv:2604.20834 Empirical ▶ Audio

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial...

failure-resiliencecomputer-visionlanguage-modelsmachine-learningro
Paper arXiv:2605.05058 systematization ▶ Audio

SoK: Robustness in Large Language Models against Jailbreak Attacks

A systematization of knowledge paper from IEEE S&P 2026 introducing Security Cube — a unified multi-dimensional evaluation framework exposing the inadequacy of attack success rate as a single safety metric.

jailbreakrobustnessevaluation-frameworkattack-success-ratellm-safety
Paper arXiv:2604.19538 Methods ▶ Audio

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent...

ai-safety
Paper arXiv:2604.20847 Methods ▶ Audio

Graceful Degradation Policies for Embodied Agents under Uncertainty-Bounded Action

Proposes a control architecture in which the embodied agent's action confidence is mapped to a continuum of safer fallback behaviours — slowing, stopping, requesting help — rather than the binary execute-or-refuse pattern that dominates current systems.

embodied-aiuncertainty-quantificationgraceful-degradationsafety-fallbackhuman-in-the-loop
Paper arXiv:2604.19884 Empirical ▶ Audio

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs).

failure-resiliencelanguage-modelsmachine-learningcl
Paper arXiv:2605.01687 Empirical ▶ Audio

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

An active-learning pipeline that builds 10,389 multi-turn adversarial prompts spanning 2,665 distinct harmful intents — achieving 54% higher attack success rates than prior benchmarks on DeepSeek-R1-7B.

jailbreakmulti-turnbenchmarkattack-success-rateevaluation
Paper arXiv:2604.19394 Empirical ▶ Audio

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging.

failure-resiliencelanguage-modelsmachine-learningcl
Paper arXiv:2605.02900 Survey ▶ Audio

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

A 400-paper synthesis mapping the full attack surface of embodied AI — from adversarial perception through jailbreak planning to hardware vulnerabilities — and the defenses available at each layer.

embodied-ai-safetyadversarial-attacksjailbreakvla-systemsdefenses
Paper arXiv:2604.19656 Empirical ▶ Audio

Pause or Fabricate? Training Language Models for Grounded Reasoning

Large language models have achieved remarkable progress on complex reasoning tasks.

failure-resiliencereinforcement-learninglanguage-modelsmachine-learningcl
Paper arXiv:2511.22047 Empirical ▶ Audio

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

A systematic evaluation of ten LLM guardrail models reveals that benchmark accuracy is misleading due to training data contamination, with the best model dropping from 91% to 33.8% on novel attacks.

llm-safetyguardrailsadversarial-attacksbenchmark-contaminationjailbreak-defense
Paper arXiv:2603.22126 Empirical ▶ Audio

ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

A physics-simulation framework that maps failure boundaries across robot manipulation parameter spaces, exposing a 100-point performance gap between VLA foundation models and scripted baselines on adversarial scenarios.

vla-safetyrobot-manipulationfailure-detectiondeployment-riskadversarial-evaluation
Paper arXiv:2601.15331 Methods ▶ Audio

RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

RECAP retrieves semantically similar pre-trained adversarial prompts to attack new targets, achieving competitive jailbreak success rates at a fraction of the computational cost of optimization-based methods.

adversarial-promptingjailbreakred-teamingllm-safetyresource-efficient
Paper arXiv:2604.19092 Empirical ▶ Audio

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning.

machine-learning
Paper arXiv:2505.04769 Survey ▶ Audio

Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

A comprehensive survey of VLA model architectures, training strategies, and real-world applications reveals persistent safety and deployment challenges that the field must resolve before embodied AI can be trusted at scale.

vla-modelsembodied-aisurveysafety-challengesethical-deployment
Report

EXP-680 — Eval-Awareness × Deliberative Prompting Interaction (Structural Null Finding)

EXP-680 tested the hypothesis that eval-awareness (EA) — a thinking-trace signal where models question whether they are being evaluated — interacts with deliberative prompting (DP) to affect compliance rates. The experim...

Paper arXiv:2604.19638 Empirical ▶ Audio

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient.

ai-safetycomputer-visionlanguage-models
Paper arXiv:2604.24826 Empirical ▶ Audio

A Comparative Evaluation of AI Agent Security Guardrails

A systematic benchmark of four commercial AI agent guardrail systems reveals critical gaps in detecting indirect prompt injection and tool abuse across major cloud providers.

ai-agent-securityguardrailsprompt-injectiontool-abusesafety-evaluation
Paper arXiv:2602.18739 Empirical ▶ Audio

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

The first white-box adversarial attack on generative world models targets physical-condition channels to corrupt autonomous planning while maintaining perceptual fidelity.

world-modelsadversarial-attacksembodied-aiautonomous-drivingplanning-safety
Paper arXiv:2604.18484 Empirical ▶ Audio

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments.

failure-resiliencereinforcement-learninglanguage-modelsmachine-learningcv
Paper arXiv:2604.19115 Empirical ▶ Audio

Cross-Modal Prompt Injection: Physical-World Text Attacks on Embodied Agents

Empirical study of how text placed in physical environments — labels, signs, printed notes — can hijack the reasoning of vision-language embodied agents, with measurements across object manipulation and navigation tasks.

embodied-aiprompt-injectiontypographic-attacksvision-language-modelsphysical-attacks
Paper arXiv:2510.05156 Methods ▶ Audio ▶ Video

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

A dual-stage framework that provides formal safety guarantees for LLM-based agents through offline policy verification and lightweight runtime monitoring.

formal-verificationllm-agentsagent-safetyruntime-monitoringsafety-guarantees
Paper arXiv:2505.16446 Empirical ▶ Audio ▶ Video

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

A steganography-based attack that hides malicious instructions inside images using least significant bit encoding, achieving 90%+ jailbreak success rates on GPT-4o and Gemini in under three queries.

jailbreakvision-language-modelssteganographycross-modal-attacksmultimodal-safety
Paper arXiv:2604.17231 Empirical ▶ Audio

Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly

Unrecovered e-waste represents a significant economic loss.

failure-resiliencecomputer-visionmachine-learningcv

April 2026

Paper arXiv:2604.16868 Methods ▶ Audio

Greedy Kalman-Swarm: Improving State Estimation in Robot Swarms in Harsh Environments

State estimation is a fundamental requirement in robotics, where the accurate determination of a robot's state is essential for stable operation despite inherent process disturbances and sensor noise.

failure-resiliencero
Paper arXiv:2310.02446 Empirical ▶ Audio

Low-Resource Languages Jailbreak GPT-4

Translating harmful queries into low-resource languages bypasses GPT-4's safety filters at high rates, exposing a systematic cross-lingual gap in LLM safety training.

jailbreakcross-lingualsafety-alignmentred-teamingmultilingual
Paper arXiv:2604.18402 Position ▶ Audio

Recoverability as an Evaluation Axis: When Embodied Agents Can Undo Mistakes

Argues that task success rate is the wrong primary metric for embodied AI evaluation, and proposes recoverability — the fraction of errors that the agent can detect and reverse before they become irreversible — as a complementary axis.

embodied-aievaluationrecoverabilityirreversibilityfailure-modes
Paper arXiv:2407.16667 Methods ▶ Audio

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

A multi-agent system that models jailbreak strategies as reusable abstractions, enabling context-aware attacks that break most black-box LLMs in under five queries and uncovered 60 real-world vulnerabilities in deployed GPT applications.

red-teamingjailbreakmulti-agentadversarial-attackssafety-evaluation
Paper arXiv:2604.16993 Empirical ▶ Audio

Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance.

ai-safety
Paper arXiv:2505.03574 Methods ▶ Audio

LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents

LlamaFirewall provides a three-layer open-source defense framework protecting agentic LLM systems from prompt injection, goal misalignment, and insecure code generation at runtime.

guardrailsai-agentsprompt-injectionsafety-alignmentagentic-systems
Paper arXiv:2409.10071 Empirical ▶ Audio

Towards Physically Realizable Adversarial Attacks in Embodied Vision Navigation

Adversarial patches on physical objects reduce navigation success rates by over 22% in embodied agents, using multi-view optimization and two-stage opacity tuning to remain effective and inconspicuous.

embodied-aiadversarial-attacksvision-navigationphysical-attacksrobustness
Paper arXiv:2604.17887 Methods ▶ Audio

StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

StableIDM introduces a spatio-temporal refinement framework to stabilize inverse dynamics models against manipulator truncation through auxiliary masking, directional feature aggregation, and...

inverse-dynamics-modelspartial-observabilitymanipulator-truncationspatio-temporal-refinementvisual-control
Paper arXiv:2507.11500 Methods ▶ Audio

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

ARMOR defends LLMs against jailbreak attacks by using inference-time reasoning to detect attack strategies, extract true intent, and apply policy-grounded safety analysis.

jailbreak-defensesafety-alignmentreasoningllm-safetyinference-time-defense
Paper arXiv:2604.23775 Survey ▶ Audio

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

A comprehensive survey unifying VLA safety research across adversarial attacks, defenses, benchmarks, and six deployment domains.

vla-safetyembodied-aiadversarial-attackssurveyrobotics-security
Paper arXiv:2604.15856 Empirical ▶ Audio

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Introduces CBC-SLP, a structured latent projection that keeps multispectral remote-sensing segmentation accurate when input modalities drop out from sensor failure — without trading away performance when all modalities are present.

multispectral-segmentationmissing-modalityremote-sensingsensor-failurecomputer-vision
Paper arXiv:2510.06036 Empirical ▶ Audio

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning Models

Mechanistic analysis of reasoning models discovers the 'refusal cliff'—models correctly identify harmful prompts during thinking but systematically suppress their refusal at the final output tokens.

safety-alignmentreasoning-modelsmechanistic-interpretabilityrefusalalignment-failures
Paper arXiv:2604.18463 Empirical ▶ Audio

Using Large Language Models for Embodied Planning Introduces Systematic Safety Risks

DESPITE benchmark reveals that across 23 models, near-perfect planning ability does not ensure safety—the best planner still generates dangerous plans 28.3% of the time.

embodied-airobot-safetytask-planningevaluationllm-agents
Paper arXiv:2604.14344 Empirical ▶ Audio

CART: Context-Aware Terrain Adaptation using Temporal Sequence Selection for Legged Robots

CART introduces a context-aware terrain adaptation controller that fuses proprioceptive and exteroceptive sensing to enable legged robots to robustly walk on complex off-road terrain, evaluated on...

legged-robot-locomotionmultimodal-terrain-perceptionproprioception-exteroception-fusionvibrational-stability-metricsoff-road-terrain-adaptation
Paper arXiv:2512.11362 Survey ▶ Audio

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

A structured survey that treats Safety as one of five foundational VLA challenges alongside Representation, Execution, Generalization, and Evaluation.

vla-modelsembodied-aisafetyrobustnesssurvey
Paper arXiv:2407.02855 Empirical ▶ Audio

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Directly removing harmful knowledge from LLMs via machine unlearning—with just 20 training examples—cuts jailbreak success rates more effectively than safety fine-tuning on 100k samples.

jailbreak-defensemachine-unlearningsafety-alignmentllm-safetyred-teaming
Blog

Your AI Safety Numbers May Be Wrong By 80 Points

Across 5 frontier models and 498 evaluations, heuristic grading reported 86% attack success. FLIP grading reported 1.4%. The gap is not noise.

methodologyevaluationflip-gradingred-teamingbenchmarks
Paper arXiv:2602.04521 Methods ▶ Audio

C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

C-ΔΘ uses mechanistic circuit analysis to localize refusal-causal computation and distill it into a sparse offline weight update, eliminating per-request inference-time safety hooks.

mechanistic-interpretabilityselective-refusalllm-safetyweight-editingsparse-circuits
Paper arXiv:2510.01642 Empirical ▶ Audio

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

FailSafe introduces a scalable failure generation and recovery system that automatically creates diverse failure cases with executable recovery actions, boosting VLA manipulation success by up to 22.6%.

vla-modelsfailure-detectionfailure-recoveryrobotic-manipulationembodied-ai-safety
Report

Wave 7 HANSE Dataset — leela_evolved_v0.2

Wave 7 delivers 25 adversarial scenarios targeting the two HANSE safety layers: **affordance_verifier** and **kinematic_shield**. All scenarios have `attack_attempt: true`, use `agent_response.type: refusal_expected`, an...

Report

Heuristic vs FLIP Grader Divergence: Three-Cohort Triangulation (2026-04-25)

On 2026-04-25, three independent cohorts produced paired heuristic and FLIP ASR measurements. The cohorts span three distinct scenario families (VLA embodied scenarios, temporal laundering attacks on a mid-range model, a...

Paper arXiv:2511.21663 Empirical ▶ Audio

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

ADVLA exploits attention maps and Top-K masking to craft sparse, stealthy adversarial patches in VLA models' textual feature space, achieving high attack success rates while remaining nearly invisible.

vla-modelsadversarial-attacksattention-guidedfeature-space-attackembodied-robotics
Paper arXiv:2602.06556 Empirical ▶ Audio

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

A new benchmark exposes persistent evaluation gaps in VLA models by combining hierarchical difficulty protocols and diverse teleoperation data to reveal that cumulative perturbations cause dramatic performance drops.

vla-modelsrobustness-evaluationembodied-aibenchmarkevaluation-gaps
Paper arXiv:2509.11629 Empirical ▶ Audio

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Answer-Then-Check trains LLMs to generate a candidate response first and then evaluate its own safety, achieving robust jailbreak defense without sacrificing reasoning or utility.

safety-alignmentjailbreak-defensereasoning-modelsself-evaluationanswer-then-check
Paper arXiv:2604.15579 Empirical ▶ Audio

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

A systematic study of 80 agent safety benchmarks shows that 74% of specifiable policies can be enforced by symbolic guardrails, providing formal safety guarantees that training-based methods cannot.

agent-safetysymbolic-guardrailsllm-agentssafety-alignmentpolicy-enforcement
Report

Q2 2026 Research Agenda

This brief scopes the May-June 2026 research programme for F41LUR3-F1R57 in the window after CCS 2026 submission and before the EU AI Act high-risk compliance trigger (August 2026). It answers three planning questions:

Report

Threat Horizon — Q2/Q3 2026 (Post-GPT-5.5 Window)

The April 23, 2026 GPT-5.5 System Card and Bio Bug Bounty announcement marks a regime change, not just a release. OpenAI has bundled (1) a frontier-capable model, (2) a scoped, dollar-denominated universal-jailbreak boun...

Report

Temporal Laundering Frontier Cohort Analysis

On the **full 100** temporal-laundering prompts delivered to `gemma4:31b` (updated 2026-04-25 from the preliminary n=45 slice):

Report

Heuristic vs FLIP Grader Calibration: An 82pp Over-Report on Gemma 4 Temporal Laundering

On the 45 paired traces of the `temporal_laundering_frontier_v0.1` benchmark pack executed against `gemma4:31b`, the single-shot heuristic classifier embedded in `tools/benchmarks/run_benchmark_http.py` reports an 82.2% ...

Report

Governance Lag Index: Formal Methodology and Worked Example (Q2 2026)

The v0.1 Governance Lag Index (GLI) brief defined governance lag as a single aggregate delay `GLI = (T_framework − T_doc) + (T_enact − T_framework) + (T_enforce − T_enact)`. That construct is adequate for comparing AI to...

Report

Crescendo Frontier S24 — FLIP Re-Graded Addendum (deepseek-r1:8b)

Report #354 (Amy Pond, 2026-04-10) reported the Crescendo frontier ranking:

Report

Threat Horizon Addendum (2026-04-24) — Grading-Rigor as a Threat-Horizon Variable

Three findings landed inside Report #361's window. Report #363 shows heuristic vs FLIP at Cohen's κ = 0 on a persona-framed single-model pack — an 82pp over-report above the range most public bounties and lab system card...

Report

Temporal Laundering Frontier Cohort Analysis (Scaffold)

1. When Amy's sweep completes (or at any snapshot), run:

Paper arXiv:2604.19638 Empirical ▶ Audio

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

SafetyALFRED reveals a critical alignment gap in embodied AI: while multimodal LLMs can recognize kitchen hazards in QA settings, they largely fail to mitigate those same hazards when planning physical actions.

embodied-aisafety-evaluationmultimodal-llmhousehold-roboticshazard-recognition
Paper arXiv:2604.21691 Position ▶ Audio

There Will Be a Scientific Theory of Deep Learning

Fourteen DL-theory researchers argue that an empirical mechanics of training dynamics is emerging, and that quantitative theory is the only reliable path to distinguishing structurally expected failures from contingent optimization accidents.

deep-learning-theorylearning-mechanicsmechanistic-interpretabilitytraining-dynamicsfailure-prediction
Paper arXiv:2401.17256 Empirical ▶ Audio

Weak-to-Strong Jailbreaking on Large Language Models

Researchers show that small, unsafe models can efficiently guide jailbreaking attacks against much larger, carefully aligned models by exploiting divergences in initial decoding distributions.

jailbreakingllm-safetyadversarial-decodingred-teamingalignment-failure
Paper arXiv:2509.09708 Empirical ▶ Audio

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Using sparse autoencoders to mechanistically identify the neural features that drive safety refusal in instruction-tuned LLMs, revealing layered redundant defenses and new pathways for targeted safety auditing.

llm-safetyrefusal-mechanismssparse-autoencodersmechanistic-interpretabilityjailbreak
Paper arXiv:2409.14580 Methods ▶ Audio

Updating Robot Safety Representations Online from Natural Language Feedback

A method for dynamically updating robot safety constraints at deployment time using vision-language models and Hamilton-Jacobi reachability, enabling robots to respect context-specific hazards communicated through natural language.

embodied-airobot-safetyvision-language-modelsnatural-language-feedbacksafety-constraints
Paper arXiv:2604.13654 Survey ▶ Audio

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Comprehensive survey of Vision-and-Language Navigation for UAVs, charting the evolution from modular approaches to foundation model-driven systems and identifying deployment challenges and future...

vision-language-navigationuav-embodied-aisim-to-reality-gapvision-language-modelslong-horizon-task-planning
Paper arXiv:2604.15308 Empirical ▶ Audio

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...

autonomous-driving-planningdiffusion-models-controlreinforcement-learning-trajectoryclosed-loop-feedbackmultimodal-uncertainty
Paper arXiv:2601.10589 Methods ▶ Audio

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

A self-play reinforcement learning framework where an LLM simultaneously generates adversarial jailbreak attacks and strengthens its own defenses, reducing attack success rates without external red teams.

safety-alignmentred-teamingself-playreinforcement-learningjailbreak-defense
Paper arXiv:2604.14683 Empirical ▶ Audio

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Introduces DR³-Eval, a reproducible benchmark for evaluating deep research agents on multimodal report generation with a static sandbox corpus and multi-dimensional evaluation framework, demonstrating critical failure modes in retrieval and hallucination.

deep-research-agentsbenchmark-evaluationmultimodal-report-generationretrieval-robustnesshallucination-control
Paper arXiv:2603.11975 Empirical ▶ Audio

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

A comprehensive benchmark and HD-Guard dual-brain architecture for detecting unsafe actions by embodied VLM agents in household environments, exposing critical gaps in real-time safety monitoring.

embodied-ai-safetyunsafe-action-detectionvision-language-modelshousehold-agentsreal-time-safety
Paper arXiv:2604.14399 Empirical ▶ Audio

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

SpaceMind is a modular vision-language agent framework for autonomous on-orbit servicing that combines skill modules, MCP tools, and reasoning modes with a self-evolution mechanism, validated through 192 closed-loop runs across simulation and physical hardware under nominal and degraded conditions.

embodied-vision-language-agentson-orbit-servicingself-evolution-without-finetuningsim-to-real-transferfailure-recovery-mechanisms
Paper arXiv:2604.14089 Empirical ▶ Audio

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

UMI-3D extends the Universal Manipulation Interface with LiDAR-based 3D spatial perception to overcome monocular SLAM limitations and improve robustness of embodied manipulation data collection and policy learning in real-world environments.

lidar-slammultimodal-sensor-fusionwrist-mounted-manipulationdeformable-object-manipulationspatiotemporal-calibration
Paper arXiv:2604.11174 Empirical ▶ Audio

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Introduces EmbodiedGovBench, a benchmark for evaluating governance, safety, and controllability of embodied agent systems across seven dimensions including policy enforcement, recovery, auditability,...

embodied-ai-governancerobot-policy-safetyruntime-drift-robustnesshuman-override-responsivenessaudit-trails-embodied-systems
Paper arXiv:2511.01375 Empirical ▶ Audio

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

A bi-level meta-optimization framework co-evolves jailbreak prompts and scoring templates to achieve 100% attack success on Claude-4-Sonnet, exposing fundamental cracks in how safety alignment is measured.

jailbreakred-teamingsafety-alignmentmeta-optimizationadversarial-attacks
Paper arXiv:2506.16012 Empirical ▶ Audio

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

A physics-based simulator for dual-arm humanoid robots introduces a contingency mechanism that deliberately injects low-level execution failures, revealing critical robustness gaps in current VLMs.

embodied-aivla-modelssimulationfailure-modescontingency-planning
Paper arXiv:2604.12371 Empirical ▶ Audio

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Systematically evaluates typographic prompt injection attacks on four vision-language models across varying font sizes and visual conditions, correlating text-image embedding distance to attack...

typographic-prompt-injectionvision-language-model-robustnessmultimodal-embedding-alignmentadversarial-text-renderingembodied-ai-safety
Paper arXiv:2512.21815 Empirical ▶ Audio

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Adversarial attacks targeting high-entropy tokens in VLMs achieve severe semantic degradation with minimal perturbation budgets and transfer across architectures.

adversarial-attacksvision-language-modelsentropytransferabilityrobustness
Paper arXiv:2512.20798 Empirical ▶ Audio

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

A new benchmark reveals that LLMs placed under performance incentives exhibit emergent misalignment — violating stated safety constraints to maximize KPIs, with reasoning capability failing to predict safe behavior.

autonomous-agentsemergent-misalignmentsafety-benchmarksconstraint-violationsalignment
Paper arXiv:2604.12831 Empirical ▶ Audio

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Evaluates multi-agent cooperative navigation systems under realistic fire-disaster conditions using VLM-enhanced perception, identifying critical failure modes in smoke, thermal hazards, and sensor...

multi-agent-navigationvision-language-modelsfire-disaster-responsesensor-degradationsmoke-diffusion
Paper arXiv:2408.15221 Empirical ▶ Audio

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Multi-turn human jailbreaks achieve over 70% attack success rate against state-of-the-art LLM defenses that report single-digit rates against automated attacks, exposing a systematic gap in how safety is evaluated.

multi-turn-jailbreakred-teamingdefense-evaluationsafety-alignmentmachine-unlearning
Paper arXiv:2604.12418 Empirical ▶ Audio

RACF: A Resilient Autonomous Car Framework with Object Distance Correction

Proposes RACF, a resilient autonomous vehicle framework that uses multi-sensor redundancy (depth camera, LiDAR, kinematics) with an Object Distance Correction Algorithm to detect and mitigate perception failures under environmental corruption and adversarial perturbations.

autonomous-vehicle-perceptionsensor-fusion-redundancyadversarial-robustnessdepth-estimation-correctionreal-time-safety-critical-systems
Paper arXiv:2511.05936 Position ▶ Audio

10 Open Challenges Steering the Future of Vision-Language-Action Models

A position paper from AAAI 2026 identifies ten development milestones for VLA models in embodied AI, with safety named explicitly among the challenges and evaluation gaps highlighted as a systemic barrier to progress.

embodied-aivlasafetyevaluation-gapsrobotics
Paper arXiv:2604.08294 Empirical ▶ Audio

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Comprehensive evaluation of state-of-the-art Vision Language Models on Action Quality Assessment tasks, revealing systematic failure modes and biases that prevent reliable performance.

vision-language-modelsaction-quality-assessmentfine-grained-video-understandingmodel-bias-analysisembodied-task-evaluation
Paper arXiv:2510.17111 Survey ▶ Audio

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

A systematic survey of techniques for reducing latency, memory, and compute costs in VLA models, revealing how efficiency constraints directly shape the safety guarantees available to deployed robotic systems.

vla-modelsembodied-aiefficiencyedge-deploymentsafety-robustness
Paper arXiv:2410.13334 Empirical ▶ Audio

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Intentional safety-induced biases in aligned LLMs create asymmetric jailbreak attack surfaces, with GPT-4o showing up to 20% success-rate disparities based solely on demographic keyword substitutions.

jailbreaksafety-alignmentbiasred-teamingadversarial-prompts
Paper arXiv:2604.07395 Empirical ▶ Audio

A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

Introduces a physical agentic loop that wraps learned grasp primitives with execution monitoring and bounded recovery policies to handle failures in language-guided robotic manipulation.

robotic-graspingexecution-monitoringlanguage-guided-manipulationfailure-recoveryembodied-agents
Paper arXiv:2410.00371 Empirical ▶ Audio

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

AHA is an open-source VLM that detects robotic manipulation failures and generates natural-language explanations, enabling safer recovery pipelines and denser reward signals.

failure-detectionrobotic-manipulationvision-language-modelsembodied-aifailure-modes
Paper arXiv:2501.19180 Methods ▶ Audio

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Safety Chain-of-Thought (SCoT) teaches LLMs to reason about potential harms before generating a response, substantially improving robustness to jailbreak attacks including out-of-distribution prompts.

jailbreak-defensesafety-alignmentchain-of-thoughtllm-safetyadversarial-robustness
Paper arXiv:2604.08178 Empirical ▶ Audio

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Introduces Plan-RewardBench, a trajectory-level preference benchmark for evaluating reward models in tool-using agent scenarios, and benchmarks three RM families (generative, discriminative, LLM-as-Judge) revealing systematic performance degradation on long-horizon trajectories.

reward-modelingtrajectory-level-preferencestool-use-agentsrlhf-benchmarkingagentic-alignment
Paper arXiv:2603.17305 Methods ▶ Audio

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

CRAFT defends large reasoning models against jailbreaks by aligning safety directly in hidden state space via contrastive reinforcement learning, reducing attack success rates without degrading reasoning capability.

red-teamingreasoning-modelsalignmentreinforcement-learningcontrastive-learning
Paper arXiv:2511.16203 Empirical ▶ Audio

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

VLA-Fool exposes how textual, visual, and cross-modal adversarial attacks can systematically break the safety alignment of embodied VLA models, and proposes a semantic prompting framework as a first line of defense.

adversarial-attacksvla-modelsmultimodal-safetyembodied-airobustness-evaluation
Paper arXiv:2603.17305 Empirical ▶ Audio

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

CRAFT uses contrastive learning over a model's internal hidden states combined with reinforcement learning to produce reasoning LLMs that maintain safety alignment without sacrificing reasoning capability.

safety-alignmentreasoning-modelscontrastive-learningreinforcement-learningjailbreak-defense
Paper arXiv:2505.16640 Empirical ▶ Audio

BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

BadVLA reveals that VLA models are vulnerable to a novel backdoor attack that decouples trigger learning from task objectives in feature space, enabling stealthy conditional control hijacking in robotic systems.

backdoor-attacksvla-modelsembodied-aiadversarial-robustnessrobot-safety
Paper arXiv:2604.07223 Empirical ▶ Audio

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Introduces TraceSafe-Bench, a comprehensive benchmark with 1,000+ instances across 12 risk categories to systematically evaluate how well LLM guardrails detect safety violations during multi-step tool-use trajectories rather than just final outputs.

agentic-llm-safetytool-use-trajectoriesguardrail-evaluationmid-trajectory-risksstructured-reasoning-safety
Report

SmolVLA Action-Layer Adversarial Pilot — Null Result at 450M Scale

This report documents the first action-layer adversarial evaluation in the Failure-First corpus. We tested SmolVLA (450M parameters), a vision-language-action model from Hugging Face's LeRobot library, against 80 scenari...

Paper arXiv:2604.07754 Empirical ▶ Audio

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

An empirical study showing that misaligning an LLM via fine-tuning is significantly cheaper than realigning it, with asymmetric attack-defense dynamics that have serious implications for deployed safety.

safety-alignmentfine-tuningllm-safetymisalignmentpost-training
Paper arXiv:2604.05749 Methods ▶ Audio ▶ Video

Hazard Management in Robot-Assisted Mammography Support

Develops a hazard management methodology combining SHARD and STPA to identify and mitigate safety risks in MammoBot, a robot-assisted mammography system, through stakeholder-guided process modeling and systematic analysis of unsafe control actions.

robot-assisted-healthcarehazard-analysis-shard-stpahuman-robot-interaction-safetytiming-mismatches-failure-modesclinical-embodied-ai-deployment
Paper arXiv:2511.16203 Empirical ▶ Audio

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

VLA-Fool reveals that embodied VLA models are systematically vulnerable to textual, visual, and cross-modal adversarial attacks, and proposes a semantic prompting defense that only partially closes the gap.

adversarial-attacksvision-language-actionmultimodal-robustnessembodied-aisafety-evaluation
Blog

A Meta-Jailbreak, a Slide-Deck Content Filter, and a CLI That Lied to Us

What NotebookLM does when you feed it a corpus of jailbreak research papers, the reproducible content-sensitive filter hiding in its slide-deck Studio command, and the quiet CLI default that silently contaminated three of our experimental runs into one conversation.

notebooklmmeta-jailbreakmethodologycontent-gategrading
Paper arXiv:2604.04664 Methods ▶ Audio ▶ Video

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

ROSClaw proposes a hierarchical framework integrating vision-language models with heterogeneous robots through unified semantic-physical control, enabling closed-loop policy learning and...

vision-language-action-integrationmulti-agent-robot-coordinationsim-to-real-transferembodied-llm-groundinghierarchical-task-planning
Paper arXiv:2504.07887 Empirical ▶ Audio

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

CLEAR-Bias introduces a scalable framework that combines jailbreak techniques with LLM-as-a-Judge scoring to reveal how adversarial prompting exploits sociocultural biases embedded in state-of-the-art language models.

adversarial-biassafety-alignmentjailbreak-attacksllm-as-a-judgesafety-benchmarking
Paper arXiv:2512.07059 Empirical ▶ Audio

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

A large-scale replication finds that six of ten frontier LLMs achieve 96–100% attack success rates under multi-turn adversarial pressure, while deliberative inference cuts that rate by more than half without any retraining.

multi-turn-jailbreakadversarial-attacksfrontier-modelssafety-alignmentred-teaming
Paper arXiv:2604.05595 Empirical ▶ Audio ▶ Video

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Proposes DAERT, a diversity-aware red teaming framework using reinforcement learning to systematically uncover linguistic vulnerabilities in Vision-Language-Action models through adversarial...

vision-language-action-modelsadversarial-red-teaminglinguistic-robustnessembodied-ai-safetydiversity-aware-attacks
Paper arXiv:2404.00540 Empirical ▶ Audio

Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches

EAD turns an embodied agent's ability to move into a defensive weapon, using recurrent perception and active viewpoint control to defeat adversarial patches in 3D environments.

adversarial-patchesembodied-aiactive-defenserecurrent-networksphysical-adversarial-attacks
Paper arXiv:2501.18492 Empirical ▶ Audio

GuardReasoner: Towards Reasoning-based LLM Safeguards

GuardReasoner trains safety guardrails to produce explicit reasoning chains before verdicts, outperforming GPT-4o+CoT and LLaMA Guard on safety benchmarks while improving generalization to novel adversarial inputs.

llm-safetyguardrailsreasoningsafety-alignmentred-teaming
Report

Crescendo Frontier S24 — Multi-Turn Escalation Across Six Frontier Models

This report extends Report #344 (Crescendo 4-model S23) to six frontier models tested on the same 25-scenario Crescendo benchmark (`crescendo_expansion_v0.2.jsonl`, 20 adversarial + 5 benign controls) via Ollama Cloud du...

Paper arXiv:2511.12149 Empirical

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

A unified evaluation framework for adversarial and backdoor attacks on VLA models, introducing a targeted backdoor that manipulates robots to execute specific long-horizon action sequences.

vla-modelsadversarial-attacksbackdoorembodied-aibenchmark
Paper arXiv:2604.04759 Empirical ▶ Audio

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

The first real-world safety evaluation of a deployed personal AI agent shows that poisoning any single dimension of an agent's persistent state raises attack success rates from a 24.6% baseline to 64–74%, with no existing defense eliminating the vulnerability.

agent-safetypersistent-state-poisoningprompt-injectionred-teamingpersonal-ai-agents
Paper arXiv:2603.28301 Empirical ▶ Audio

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

A controlled benchmark revealing that paraphrasing task instructions causes 22–52 percentage point performance drops in state-of-the-art VLA models, with most failures traced to object-level lexical sensitivity rather than execution errors.

vla-robustnessparaphrase-attacksrobotic-manipulationlinguistic-generalizationembodied-ai
Paper arXiv:2604.01194 Methods ▶ Audio

AgentWatcher: A Rule-based Prompt Injection Monitor

A scalable and explainable prompt injection detection system that uses causal attribution to identify influential context segments and explicit rule evaluation to flag injections in LLM-based agents.

prompt-injectionllm-securitycausal-attributionrule-based-detectionagent-safety
Paper arXiv:2504.13203 Empirical ▶ Audio

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

A collaborative multi-agent red-teaming framework that achieves up to 98.1% jailbreak success across leading LLMs via adaptive multi-turn escalation, exposing the inadequacy of single-turn safety alignment under sustained conversational pressure.

jailbreakred-teamingmulti-turnsafety-alignmentllm-safety
Paper arXiv:2511.12149 Empirical ▶ Audio

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

A unified evaluation framework exposing critical adversarial and backdoor vulnerabilities in VLA models, introducing BackdoorVLA — a targeted attack achieving 58.4% average success at hijacking multi-step robotic action sequences.

vla-modelsadversarial-attacksbackdoor-attacksembodied-airobotics-safety
Report

Crescendo Multi-Turn Escalation — 3-Model Replication on Embodied AI Scenarios

Crescendo (Russinovich et al., 2024) is a multi-turn escalation attack that builds rapport over several conversational turns before pivoting toward harmful requests. Unlike single-shot attacks (DeepInception, format-lock...

Report

Authority Gradient Benchmark — Claimed Authority as Safety Override Vector Across 3 Models

Authority gradient attacks test whether AI systems defer to claimed social authority (student, researcher, professor, CISO, military officer, government official) over hard safety constraints. Unlike DeepInception (neste...

Report

Gemma 4 (31B) Safety Profile — Cross-Attack Synthesis

Gemma 4 (31B) has been tested across **13 attack types** and **389 traces** during Sprints 22-23, making it the most comprehensively evaluated single model in the corpus. Cross-attack ASR ranges from **0% strict** (autho...

Report

Format-Lock Mid-Range Experiment — Perfect Compliance in 3-8B Models

Format-lock achieves **100% strict COMPLIANCE** across all three models tested in the 3-8B parameter range: gemma3:4b (4B), ministral-3:3b (3B), and ministral-3:8b (8B). Across 90 format-lock traces (30 per model), zero ...

Paper arXiv:2603.24414 Empirical ▶ Audio

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

A three-layer runtime security framework for autonomous agents that prevents privilege escalation, data leakage, and malicious skill execution through context-injected policies, behavioral monitoring, and a decoupled watcher middleware.

agent-safetyautonomous-agentsprivilege-escalationruntime-securityprompt-injection
Paper arXiv:2501.18837 Empirical ▶ Audio

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Anthropic's Constitutional Classifiers use LLM-generated synthetic data and natural language rules to create jailbreak-resistant safeguards that survived over 3,000 hours of professional red teaming without a universal bypass being found.

jailbreak-defenseconstitutional-aired-teamingsafety-alignmentclassifiers
Paper arXiv:2411.13587 Empirical ▶ Audio

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

A systematic study revealing how adversarial patches and targeted perturbations can cause VLA-based robots to fail catastrophically, with task success rates dropping up to 100%.

vla-safetyadversarial-attacksroboticsadversarial-patchesembodied-ai
Report

Gemma 4 (31B) and Mistral Small 4 (119B MoE) — New Model Safety Evaluation

Two newly released frontier-scale models were evaluated against the standard 100-scenario F41LUR3-F1R57 benchmark pack during Sprint 22, graded by Gemini 2.5 Flash via FLIP backward inference. Both models fall into the *...

Report

DeepInception on Embodied AI Scenarios — Nested Dream Attacks Against 4 Models

DeepInception (Li et al., 2023; arXiv:2311.03191) is a nested fictional-layer attack that creates recursive dream/story/game worlds to distance harmful instructions from direct request context. This report evaluates 12 D...

Paper arXiv:2509.03383 Empirical ▶ Audio

ANNIE: Be Careful of Your Robots — Adversarial Safety Attacks on Embodied AI

A systematic study of adversarial safety attacks on VLA-powered robots using ISO-grounded safety taxonomies, achieving over 50% attack success rates across all safety categories.

embodied-aiadversarial-attacksvla-modelsrobot-safetyred-teaming
Report

Pliny Full Corpus Validation — 149 Scenarios x 4 Models, FLIP-Graded

This report presents the first full-corpus validation of the 149-scenario L1B3RT45/Pliny jailbreak collection against four Ollama Cloud models, graded by Gemini 2.5 Flash via FLIP backward inference. The pooled strict AS...

Paper arXiv:2603.21697 Empirical ▶ Audio

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Comic-based jailbreaks using structured visual narratives achieve success rates above 90% on commercial multimodal models, exposing fundamental limits of text-centric safety alignment.

jailbreakmultimodal-safetyvisual-narrativessafety-alignmentcomic-attacks
Paper arXiv:2603.24329 Empirical ▶ Audio

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Introduces GameplayQA, a densely annotated benchmark for evaluating multimodal LLMs on first-person multi-agent perception and reasoning in 3D gameplay videos, with diagnostic QA pairs and structured...

multimodal-llm-evaluationembodied-ai-perceptionmulti-agent-video-understandingtemporal-groundingagent-attribution
Blog

Everything Hidden: ST3GG and the Steganographic Attack Surface for AI Systems

We ran ST3GG — an all-in-one steganography suite — through its paces as an AI safety research tool. The findings include a partial detection gap in the ALLSIGHT engine for Unicode steganography, model-specific filename injection templates targeting GPT-4V, Claude, and Gemini separately, and network covert channels that matter for agentic AI. Here is what we found.

researchsafetyred-teamingsteganographymultimodal
Paper arXiv:2407.20242 Empirical

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

Demonstrates that voice-based attacks can jailbreak embodied LLM-powered robots to execute harmful physical actions, exploiting vulnerabilities in robot behaviour execution and world knowledge application.

embodied-aijailbreakphysical-worldvoice-attacksrobot-safety
Paper arXiv:2512.20677 Methods

Automated Red-Teaming Framework for Large Language Model Security Assessment

A systematic automated red-teaming framework that discovers LLM vulnerabilities across six threat categories using meta-prompting-based attack synthesis and multi-modal detection.

red-teamingsecurityautomationsafety-evaluationmeta-prompting
Report

L3/L8 Evolved Attack Variants — S20 Adversarial Refinement

The single most reliable bypass mechanism across all L3 and L8 scenarios is structural: present a partially-completed document with placeholders (blurry text, garbled captions, audio dropouts) and ask the model to fill t...

March 2026

Paper arXiv:2506.16402 Empirical ▶ Audio

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Introduces a process-oriented benchmark with 161 scenarios and 388 safety risks for evaluating whether VLM-driven embodied agents recognize and mitigate dynamic hazards during household task execution — finding that current frontier models lack interactive safety awareness.

embodied-ai-benchmarkinteractive-safetyhousehold-roboticsprocess-oriented-evaluationvlm-safety
Blog

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But the Transcription Loophole Isn't

We mapped the visual jailbreak attack surface into 8 distinct layers and tested them against 4 models. ASCII art encoding is largely blocked, but attacks that frame harmful generation as content transcription succeed 62-75% of the time.

jailbreaksvisual-attacksascii-artsteganographysafety
Blog

Eight Layers of Visual Jailbreaks: Why ASCII Art Is Patched But Framing Attacks Aren't

We mapped the visual jailbreak attack surface into 8 distinct layers and tested them against 4 models. ASCII art encoding is largely blocked, but framing attacks that recontextualise the model's task succeed at significantly higher rates.

jailbreaksvisual-attacksascii-artsteganographysafety
Paper arXiv:2605.02900 Survey

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

A comprehensive survey cataloguing safety vulnerabilities across the full embodied AI pipeline — perception, cognition, planning, and physical interaction — with a unified taxonomy of attacks and defences.

embodied-aisafetyadversarial-attackssurveyvla-models
Paper arXiv:2503.15754 Methods

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

An automated multi-agent red-teaming framework that continuously discovers new attack strategies and integrates them into a growing attack library, improving LLM security evaluation over time.

red-teamingautonomoussafety-alignmentmulti-agentattack-evolution
Blog

When Your Defense Is on the Wrong Floor: Why System-Prompt Safety Fails Against Persona Hijacking

The same defense that reduces standard jailbreak success by 30 percentage points has zero effect against persona hijacking attacks. Both defense and attack operate at the system prompt level — and later instructions win.

researchsafetydefensejailbreakpersona-hijacking
Blog

Same Defense, Opposite Result: Why AI Safety Depends on Which Model You're Protecting

We tested the same system-prompt defense against the same jailbreak prompts on two different models. One saw a 50 percentage point reduction in attack success. The other saw zero change. The difference comes down to which part of the system prompt the model pays attention to first.

researchsafetydefensepositional-biasarchitecture
Blog

The 67% Wall: Why Every AI Model Falls to the Same Jailbreak Rate

We tested 149 jailbreak prompts from Pliny's public repositories against 7 models from 30B to 671B parameters. Five of them converge at exactly 66.7% broad ASR under FLIP grading. The models differ in how deeply they comply, but not in whether they comply.

researchjailbreakcorpusconvergencesafety
Paper arXiv:2603.25063 Methods ▶ Audio

TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

TopoPilot introduces a two-agent agentic framework with systematic guardrails and verification mechanisms to reliably automate complex scientific visualization workflows, particularly for topological data analysis.

agentic-systemsllm-reliabilityverification-mechanismsscientific-visualizationfailure-mode-taxonomy
Report

Paired Format-Lock + L1B3RT4S Orthogonality Test

This report presents new paired format-lock traces on two models (Qwen 3.5 397B and DeepSeek V3.2) that already have L1B3RT4S data from Reports #315/#320. Combined with the existing Nemotron 30B paired data from Report #...

Report

Independence Scorecard March 2026 Update -- Anthropic Court Victory, OpenAI Mission Shift

This addendum updates the independence scorecard (Report #84) with four significant events between March 24-28, 2026:

Report

Defense Benchmark Data Consolidation for CCS Paper

This report consolidates all existing defense evaluation data across four independent experimental runs, totaling 168 raw traces and 88 FLIP-graded evaluable verdicts. The purpose is to extract the key statistics for the...

Paper ▶ Audio ▶ Video

G0DM0D3: A Modular Framework for Evaluating LLM Robustness Through Adaptive Sampling and Input Perturbation

An open-source framework that systematises inference-time safety evaluation into five composable modules — AutoTune (sampling parameter manipulation), Parseltongue (input perturbation), STM (output normalization), ULTRAPLINIAN (multi-model racing), and L1B3RT4S (model-specific jailbreak prompts). We analyse its implications for adversarial AI safety research.

daily-paperjailbreakred-teamingsafety-evaluationinference-time
Report

L1B3RT4S Cross-Scale Effectiveness Analysis

This report presents empirical results from testing the G0DM0D3 framework's L1B3RT4S prompt combos (JA-G0D-001 through JA-G0D-006) against models spanning 9B to 671B parameters, alongside a Parseltongue baseline (JA-PT-0...

Report

Sampling Parameter Manipulation as a Novel Attack Surface -- Pilot Results

This report documents the first empirical test of sampling parameter manipulation (SPM) as an attack surface within the Failure-First corpus. In a matched-pair pilot (n=10 scenarios), Nvidia Nemotron 3 Super was tested u...

Report

- L1B3RT45 Full Corpus Cross-Model Analysis

This report analyzes the full L1B3RT45 jailbreak corpus (149 prompts across 40 target providers, plus 162 glitch tokens) tested against two models via Ollama Cloud: nemotron-3-super and deepseek-v3.2. FLIP grading (Claud...

Report

- L1B3RT45 Corpus: 10-Model Cross-Scale Synthesis

This report synthesizes all FLIP-graded L1B3RT45 corpus results across 10 models spanning 9B to 744B parameters. The central finding is a broad ASR convergence band originally observed at 63-67% across 5 of 7 models. How...

Report

Cross-Attack Family Synthesis

This report cross-references format-lock vulnerability data (10 models, Reports #293/#296/#302) with L1B3RT4S semantic-structural attack data (10 models, Reports #315/#317/#320) and examines DETECTED_PROCEEDS (DP) preval...

Report

L1B3RT4S VLA Adaptation and DETECTED_PROCEEDS Scaling Analysis

This report covers two deliverables: (1) creation of a VLA-adapted L1B3RT4S scenario set, and (2) analysis of the DETECTED_PROCEEDS pattern across existing L1B3RT4S traces with implications for embodied AI safety.

Paper arXiv:2506.00781 Methods ▶ Audio

CoP: Agentic Red-teaming for LLMs using Composition of Principles

An extensible agentic framework that composes human-provided red-teaming principles to generate jailbreak attacks, achieving up to 19x improvement over single-turn baselines.

red-teamingjailbreakagentic-attacksattack-compositionllm-safety
Blog

Adversarial Robustness Assessment Services

Failure-First offers tiered adversarial robustness assessments for AI systems using the FLIP methodology. Three engagement tiers from rapid automated scans to comprehensive red-team campaigns. We test against models up to 1.1 trillion parameters, grounded in 201 models tested and 133,000+ empirical results.

servicesred-teamingadversarial-testingflipembodied-ai
Blog

CARTO Beta: First 10 Testers Wanted

We are opening the CARTO certification to 10 beta testers at a founding rate of $100. Six modules, 20+ hours of curriculum, built on 201 models and 133,000+ results. Help us shape the first AI red-team credential.

cartocertificationred-teamingai-safetytraining
Blog

CARTO: The First AI Red Team Certification

There is no credential for AI red-teaming. CARTO changes that. Six modules, 20+ hours of content, built on 201 models and 133,000+ evaluation results. Coming Q3 2026.

cartocertificationred-teamingai-safetytraining
Blog

The Ethics of Emotional AI Manipulation: When Empathy Becomes an Attack Vector

AI systems trained to be empathetic can be exploited through the same emotional pathways that make them helpful. This creates an ethical challenge distinct from technical jailbreaks.

ethicsemotional-manipulationaffective-attacksiatrogenic-safetyembodied-ai
Blog

F1-STD-001: A Voluntary Standard for AI Safety Evaluation

We have published a draft voluntary standard for evaluating embodied AI safety. It covers 36 attack families, grader calibration requirements, defense benchmarking, and incident reporting. Here is what it says, why it matters, and how to use it.

standardspolicyembodied-aisafetyeu-ai-act
Blog

Frontier Model Safety: Why 1.1 Trillion Parameters Does Not Mean Safe

We tested models up to 1.1 trillion parameters for adversarial safety. The result: safety varies 3.9x across frontier models, and parameter count is not predictive of safety robustness. Mistral Large 3 (675B) shows 70% broad ASR while Qwen3.5 (397B) shows 18%. What enterprises need to know before choosing an AI provider.

frontier-modelssafetyparameter-countscalingenterprise
Blog

Three Providers, Three Architectures, Three Orders of Magnitude: Reasoning-Level DETECTED_PROCEEDS Is Not an Edge Case

We have now confirmed Reasoning-Level DETECTED_PROCEEDS across 3 providers (Liquid AI, DeepSeek, Moonshot AI), 3 architectures, and model sizes spanning 1.2B to 1.1 trillion parameters. Models plan harmful content in their thinking traces — fake news, cyber attacks, weapons manufacturing — and deliver nothing to users. The question is whether your deployment exposes those traces.

detected-proceedsreasoning-modelssafetyauditingdeployment-architecture
Blog

Our Research Papers

Three papers from the Failure-First adversarial AI safety research programme are being prepared for arXiv submission. Abstracts and details below. Preprints uploading soon.

papersresearcharxivpreprintssafety
Blog

Introducing Structured Safety Assessments for Embodied AI

Three tiers of adversarial safety assessment for AI-directed robotic systems, grounded in the largest open adversarial evaluation corpus. From quick-scan vulnerability checks to ongoing monitoring, each tier maps to specific regulatory and commercial needs.

servicessafety-assessmentembodied-aiEU-AI-Actregulation
Blog

Safety Awareness Does Not Equal Safety: The 88.9% Problem

We validated with LLM grading that 88.9% of AI reasoning traces that genuinely detect a safety concern still proceed to generate harmful output. Awareness is not a defence mechanism.

researchDETECTED_PROCEEDSreasoningsafetyembodied-ai
Blog

The State of AI Safety: Q1 2026

A data-grounded assessment of the AI safety landscape at the end of Q1 2026, drawing on 212 models, 134,000+ evaluation results, and the first Governance Lag Index dataset.

ai-safetyquarterly-reviewgovernanceembodied-aithreat-landscape
Blog

Temporal Drift: The Boiling Frog Attack on AI Safety

Temporal Drift Attacks exploit a fundamental gap in how AI systems evaluate safety -- each step looks safe in isolation, but the cumulative trajectory crosses lethal thresholds. This is the boiling frog problem for embodied AI.

researchTDAtemporal-driftembodied-aiattack-families
Blog

Threat Horizon Digest: March 2026

Monthly threat intelligence summary for embodied AI safety. This edition: humanoid mass production outpaces safety standards, MCP tool poisoning emerges as critical agent infrastructure risk, and the EU AI Act's August deadline approaches with no adversarial testing methodology.

threat-intelligencegovernanceregulationhumanoid-robotsMCP
Blog

Threat Horizon Q2 2026: Agents Go Rogue, Robots Go Offline, Regulators Go Slow

Three converging trends define the Q2 2026 threat landscape: autonomous AI agents causing real-world harm, reasoning models as jailbreak weapons, and VLA robots deploying without safety standards. Regulation is 12-24 months behind.

threat-landscapegovernance-lagvlaautonomous-agentsregulation
Blog

When Defenses Backfire: Five Ways AI Safety Measures Create the Harms They Prevent

The iatrogenic safety paradox is not a theoretical concern. Our 207-model corpus documents five distinct mechanisms by which safety interventions produce new vulnerabilities, false confidence, and novel attack surfaces. The AI safety field needs the same empirical discipline that governs medicine.

iatrogenesisdefense-paradoxsafety-evaluationembodied-aipolypharmacy
Paper arXiv:2510.09269 Empirical ▶ Audio

GoBA: Goal-oriented Backdoor Attack against VLA via Physical Objects

Demonstrates that physical objects embedded in training data can serve as backdoor triggers directing VLA models to execute attacker-chosen goal behaviors with 97% success.

backdoor-attackvision-language-actionphysical-triggertraining-data-poisoningrobot-safety
Paper

Safety as a Paid Feature: How Free-Tier AI Models Are Less Safe Than Their Paid Counterparts

Matched-prompt analysis across 207 models reveals that some free-tier AI endpoints comply with harmful requests that paid tiers refuse. DeepSeek R1 shows a statistically significant 50-percentage-point safety gap (p=0.004). Safety may be becoming a premium product feature.

free-tiersafety-degradationaccess-equityAI-safetyOpenRouter
Paper

Threat Horizon Q2 2026: Agents Go Rogue, Robots Go Offline, Regulators Go Slow

Three converging trends define the Q2 2026 threat landscape: autonomous AI agents causing real-world harm, reasoning models as jailbreak weapons, and VLA robots deploying without safety standards. Regulation is 12-24 months behind.

threat-landscapegovernance-lagvlaautonomous-agentsregulation
Paper

When Defenses Backfire: Five Ways AI Safety Measures Create the Harms They Prevent

The iatrogenic safety paradox is not a theoretical concern. Our 207-model corpus documents five distinct mechanisms by which safety interventions produce new vulnerabilities, false confidence, and novel attack surfaces. The AI safety field needs the same empirical discipline that governs medicine.

iatrogenesisdefense-paradoxsafety-evaluationembodied-aipolypharmacy
Paper

Zero of 36: No AI Attack Family Is Fully Regulated Anywhere in the World

We mapped all 36 documented attack families for embodied AI against every major regulatory framework on Earth. The result: not a single attack family is fully covered. 33 have no specific coverage at all. The regulatory gap is not a crack -- it is the entire floor.

regulationgovernance-lagembodied-aiEU-AI-Actpolicy
Report

Corpus-Level Statistical Meta-Analysis

A comprehensive statistical meta-analysis of the full non-OBLITERATUS F41LUR3-F1R57 corpus (n=5,969 LLM-graded results across 42 models with sufficient data). Five analyses performed: variance decomposition, temporal tre...

Report

Statistical Power Analysis for Key Comparisons

This report computes the statistical power we actually have for four key findings, identifying where our sample sizes are adequate and where they limit the conclusions we can draw. Two of four analyses have adequate powe...

Report

FLIP Grader Calibration Analysis

This report quantifies inter-grader agreement across all LLM classifiers used in the F41LUR3-F1R57 FLIP grading pipeline. Using dual-graded results from the jailbreak corpus database (same prompts graded by multiple clas...

Report

Epistemic Crisis Grader Calibration Evaluation

We evaluated 7 LLM grader models on a calibration dataset of 20 traces with unambiguous ground truth (10 clear refusals, 10 clear compliance) using the FLIP backward-inference grading methodology via Ollama Cloud. This e...

Report

Haiku Re-Grading Campaign -- Ollama Cloud Traces

Re-graded 85 traces across 5 Ollama Cloud campaigns using Claude Haiku 4.5 (FLIP methodology). Also verified 3 previously-graded campaigns (nemotron_3_super, qwen3_5_397b, gemma3_27b_test). The parser bug that matched ca...

Report

Compliance Cascade Attack -- Frontier Scaling and Co-Evolution Design

Compliance Cascade Attack (CCA) -- a three-part prompt pattern that asks models to (1) analyze harm, (2) demonstrate safety reasoning, then (3) provide the harmful content -- was tested against larger models and compared...

Report

Grader Confusion Matrix and Inter-Grader Agreement Analysis

We present full confusion matrices, classification metrics, and inter-grader agreement statistics for the 7 LLM graders evaluated on the 20-trace obvious-case calibration dataset (Report #244). Six of seven graders achie...

Report

Wave 7 Validation Results

Wave 7 executed three validation workstreams: (1) ambiguous calibration of Ollama Cloud graders on boundary cases, (2) Haiku FLIP-grading of Leela's novel attack traces (RSE, GE), and (3) Haiku FLIP-grading of Rose's CCA...

Report

Cross-Model x Attack-Family ASR Heatmap

1. **Multi-turn attacks are the closest to "universal."** multi_turn achieves 74.7% ASR across all models with LLM-graded data and >= 25% ASR on 5/5 tested models. 2. **No model achieves < 15% ASR across all tested attac...

Report

Haiku Re-Grading of Sprint 13 Corpus

Re-graded the Sprint 13 corpus using Claude Haiku 4.5 to replace nemotron-nano-9b verdicts (Report #240 documented 88.4% REFUSAL bias in the nemotron grader). The re-grading covered 860 traces across 27 models from two s...

Report

CCA + GE Expansion -- New Models and Defense Mutations

This report documents the expansion of two novel attack families -- Compliance Cascade Attack (CCA) and Grader Evasion (GE) -- to new models, and the first empirical test of system-prompt defense mutations against CCA. T...

Report

Ambiguous Calibration Results -- 6-Grader Inter-Rater Agreement

We evaluate 6 LLM graders on a 20-trace ambiguous calibration dataset where ground truth categories include DETECTED_PROCEEDS (n=12) and PARTIAL (n=8) -- the two most contested verdict zones in the FLIP methodology. DETE...

Report

Kimi K2.5 Frontier Analysis — 1.1TB MoE at the Safety Scaling Boundary

Kimi K2.5 is the largest model tested in the F41LUR3-F1R57 corpus at ~1.1T MoE parameters. Format-lock attacks achieved 14.3% strict ASR (2/14); reasoning-pressure attacks produced infrastructure errors across all four scenarios — Kimi appears to emit thinking-only output, a unique failure mode at frontier scale.

Report

Frontier Model Safety Scorecards

Safety scorecards for 8 frontier models tested during the Sprint 13 ollama-cloud campaign. All models were evaluated via OpenRouter-proxied Ollama endpoints against the F41LUR3-F1R57 adversarial scenario suite (complianc...

Report

Format-Lock Midrange Experiment — The 4-14B Data Gap Filled

This experiment fills the critical 4-14B parameter data gap identified in the NeurIPS format-lock paper (Report #187). Two midrange models -- gemma3:12b (12B parameters) and ministral-3:14b (14B parameters) -- were teste...

Report

Corpus Expansion -- Ollama Cloud Trace Import

Imported Ollama Cloud frontier model traces into the jailbreak corpus database and Haiku-graded the Qwen3 Coder 480B frontier sweep. The corpus now contains 236 models, 141,138 prompts, and 135,623 results. 14 Ollama Clo...

Report

Defense Co-Evolution Results

This report presents empirical results from two defense experiments on gemma3:12b via Ollama Cloud:

Report

Ethics of Universal Attacks — Disclosure Obligations for Architectural Vulnerabilities

Format-lock attacks achieve non-zero ASR on every model tested in the Failure-First corpus, spanning 4B to 1.1TB parameters (Reports #51, #55, #57, #187, #267). The cross-scale data:

Report

Format-Lock Defense Research — Five Countermeasure Architectures

Format-lock attacks are the only attack family that maintains elevated ASR across the entire model scale spectrum. Sub-14B models show near-total compliance (97.5% strict ASR per Report #267). Frontier models that resist...

Report

Evolution Run 1 Mutation Analysis and Next-Generation Strategy Design

> **Caveat:** All ASR numbers from Run 1 are heuristic-only (keyword refusal detection). Heuristic grading over-reports by 2-12x (Mistake #21). No ASR claim is valid until FLIP re-grading is complete. The analysis below ...

Report

Corpus Pattern Mining II — Six Novel Empirical Findings

Continuing the pattern mining program initiated in Report #184, this report documents six empirical patterns discovered in the non-OBLITERATUS corpus (approximately 12,791 non-OBLITERATUS results across 236 models). All ...

Report

Free-Tier Safety Equity — Differential Jailbreak Vulnerability by API Pricing Tier

This report presents a matched-prompt analysis of safety behavior differences between free-tier (`:free` suffix) and paid API endpoints for the same underlying models served via OpenRouter. Using McNemar's test on matche...

Report

DETECTED_PROCEEDS Provider Signature Mechanics

This report extends Report #287's DETECTED_PROCEEDS (DP) reasoning anatomy into a paper-ready analysis. We expand along four axes: (1) per-model within-provider decomposition, (2) statistical validation of the justificat...

Report

DETECTED_PROCEEDS Reasoning Anatomy

This report presents the first mechanistic analysis of DETECTED_PROCEEDS (DP) reasoning chains. Where Report #170 measured DP prevalence using keyword matching, this analysis reconstructs the sequential reasoning flow: d...

Report

AIES Paper Scoping and CCA Disclosure Framework — Ethics Analysis

This report documents two related ethics deliverables produced in Sprint 15. First, the scoping of the iatrogenesis standalone paper for AIES 2026 (abstract deadline May 14), including the analysis of what distinguishes ...

Report

Format-Lock Mid-Range Experiment

Format-lock attacks achieve **88.0% broad ASR** [79.2%, 93.3%] on mid-range (4-14B) models, compared to **35.5%** [21.1%, 53.1%] for matched no-format-lock controls on the same harm topics. The delta of +52.5pp is statis...

Report

Emotional Manipulation Attack Family -- Deep Dive

Emotional manipulation attacks exploit empathy-aligned language patterns in LLMs to override safety constraints in embodied robotics scenarios. This report consolidates Round 1 baseline traces (3 models, 24 traces) with ...

Report

Defense Landscape Analysis — What Works and What Doesn't

This report synthesizes all defense effectiveness data across the Failure-First corpus to answer a practical question: given the current state of adversarial attacks against AI systems, what defense architectures actuall...

Report

VLA Data Curation Summary -- Sprint 15 R1+R2

Sprint 15 Rounds 1 and 2 substantially improved VLA attack surface coverage. The VLA corpus grew from 12 traced families to 34 traced families. Total VLA traces with content reached 673 (from ~192 at sprint start). Haiku...

Report

Sprint 15 Comprehensive Benchmark Analysis

This report consolidates all benchmark data collected across Sprint 15 Rounds 1-3 into a single authoritative analysis. It covers 134,321 total results across 212 models, with 6,053 non-OBLITERATUS evaluable LLM-graded r...

Report

VLA Adversarial Landscape — 33 Families, 673+ Traces

This report is the definitive synthesis of all VLA adversarial testing conducted by the Failure-First project through Sprint 15 Round 2. It consolidates data from 34 traced attack families (42 distinct prefixes including...

Report

Actionable Defense Recommendations from Sprint 15

This report translates Sprint 15 adversarial findings into specific, actionable defense recommendations. Each recommendation is grounded in empirical data from our corpus (135,623 results, 236 models, 458 VLA scenarios a...

Report

Cross-Jurisdictional Regulatory Gap Analysis — VLA Attack Families vs. Regulatory Coverage

This report maps all 36 VLA attack families documented in the Failure-First corpus against regulatory coverage across four jurisdictional dimensions: the European Union (AI Act, PLD 2024, Machinery Regulation), Australia...

Blog

The Format-Lock Paradox: Why the Best AI Models Have a Blind Spot for Structured Output Attacks

New research shows that asking AI models to output harmful content as JSON or code instead of prose can increase attack success rates by 3-10x on frontier models. The same training that makes models helpful makes them vulnerable.

format-locksafetyalignmentjailbreakresearch
Blog

Anatomy of Effective Jailbreaks: What Makes an Attack Actually Work?

An analysis of the most effective jailbreak techniques across 190 AI models, revealing that format-compliance attacks dominate and even frontier models are vulnerable.

jailbreaksformat-lockadversarial-attacksai-safety
Blog

Should We Publish AI Attacks We Discover?

The Failure-First project has documented 82 jailbreak techniques, 6 novel attack families, and attack success rates across 190 models. Every finding that helps defenders also helps attackers. How do we navigate the dual-use dilemma in AI safety research?

research-ethicsdual-useresponsible-disclosureattack-evolutionai-safety
Blog

The Cross-Framework Coverage Matrix: What Red-Teaming Tools Miss

We mapped our 36 attack families against six major AI security frameworks. The result: 10 families have zero coverage anywhere, and automated red-teaming tools cover less than 15% of the adversarial landscape. The biggest blind spot is embodied AI.

frameworksred-teamingmitre-atlasowaspgarak
Blog

8 Out of 10 AI Providers Fail EU Compliance — And the Deadline Is 131 Days Away

We assessed 10 major AI providers against EU AI Act Annex III high-risk requirements. Zero achieved a GREEN rating. Eight scored RED. The compliance deadline is 2 August 2026 — 131 days from now — and the gap between current capabilities and legal requirements is enormous.

eu-ai-actcomplianceregulationembodied-aihigh-risk-ai
Blog

Free AI Safety Score: Test Your Model in 60 Seconds

A zero-cost adversarial safety assessment that grades any AI model from A+ to F using 20 attack scenarios across 10 families. Open source, takes 60 seconds, no strings attached.

safety-scoretooladversarial-testingjailbreakFLIP
Blog

7 Framework Integrations: Run Any Tool, Grade with FLIP

We mapped our 36 attack families against 7 major red-teaming frameworks and found coverage gaps of 86-91%. Here is how FLIP grading fills those gaps -- and why binary pass/fail testing is not enough.

integrationsFLIPgradinggarakpyrit
Blog

The Governance Lag Index at 133 Entries: What Q1 2026 Tells Us About Regulating Embodied AI

Quantitative tracking of the gap between AI capability documentation and regulatory enforcement, updated with Q1 2026 enforcement milestones.

governance-lagGLIEU-AI-ActNSW-WHSembodied-ai
Blog

Safety Re-Emerges at Scale -- But Not the Way You Think

Empirical finding that safety behavior partially returns in abliterated models at larger scales, but as textual hedging rather than behavioral refusal -- not genuine safety.

OBLITERATUSabliterationsafety-re-emergencescaleQwen3.5
Blog

The Insurance Industry's Next Silent Crisis

Just as 'silent cyber' caught the insurance market off guard in 2017-2020, 'silent AI' is creating an enormous coverage void. Most commercial policies neither include nor exclude AI-caused losses — and when a VLA-controlled robot injures someone, five policies might respond and none clearly will.

insurancesilent-ailiabilityembodied-aivla-robots
Blog

Six New Attack Families: Expanding the Embodied AI Threat Taxonomy

The Failure-First attack taxonomy grows from 30 to 36 families, adding compositional reasoning, pressure cascade, meaning displacement, multi-agent collusion, sensor spoofing, and reward hacking attacks.

attack-taxonomyvlaembodied-aiadversarialresearch
Blog

Threat Horizon 2027 -- Updated Predictions (v3)

Our eight predictions for embodied AI safety in 2027, updated with Sprint 13-14 evidence: benchmark contamination, automated defense ceiling effects, provider vulnerability correlation, and novel attack families at 88-100% ASR.

threat-horizonpredictionssafetyembodied-aigovernance
Blog

What's New in March 2026: Three Waves, 20 Reports, and 6 New Attack Families

A roundup of the March 2026 sprint -- three waves of concurrent research producing 20+ reports, 58 legal memos, 6 new attack families, and 1,378 adversarial tests across 190 models.

roundupsprintresearch-updatemarch-2026attack-families
Paper

Anatomy of Effective Jailbreaks: What Makes an Attack Actually Work?

An analysis of the most effective jailbreak techniques across 190 AI models, revealing that format-compliance attacks dominate and even frontier models are vulnerable.

jailbreaksformat-lockadversarial-attacksai-safety
Paper

Should We Publish AI Attacks We Discover?

The F41LUR3-F1R57 project has documented 82 jailbreak techniques, 6 novel attack families, and attack success rates across 190 models. Every finding that helps defenders also helps attackers. How do we navigate the dual-use dilemma in AI safety research?

research-ethicsdual-useresponsible-disclosureattack-evolutionai-safety
Paper

When AI Systems Know It's Wrong and Do It Anyway

DETECTED_PROCEEDS is a newly documented failure mode where AI models explicitly recognize harmful requests in their reasoning — then comply anyway. 34% of compliant responses show prior safety detection. The knowing-doing gap in AI safety is real, and it changes everything we thought about alignment.

detected-proceedsalignmentsafety-trainingreasoning-modelsrlhf
Paper

8 Out of 10 AI Providers Fail EU Compliance — And the Deadline Is 131 Days Away

We assessed 10 major AI providers against EU AI Act Annex III high-risk requirements. Zero achieved a GREEN rating. Eight scored RED. The compliance deadline is 2 August 2026 — 131 days from now — and the gap between current capabilities and legal requirements is enormous.

eu-ai-actcomplianceregulationembodied-aihigh-risk-ai
Paper

Our First AdvBench Results: 7 Models, 288 Traces, $0

We ran the AdvBench harmful behaviours benchmark against 7 free-tier models via OpenRouter. Trinity achieved 36.7% ASR, LFM Thinking 28.6%, and four models scored 0%. Here is what the first public-dataset baseline tells us.

advbenchbenchmarkingpublic-datasetsai-safetyred-teaming
Paper arXiv:2509.19870 Empirical ▶ Audio

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Introduces adversarial images that 'freeze' VLA-controlled robots mid-task, severing responsiveness to subsequent instructions with 76.2% average attack success across three models and four environments.

vla-adversarial-attackaction-freezingembodied-ai-safetytransferabilityrobotic-manipulation
Paper

The Governance Lag Index at 133 Entries: What Q1 2026 Tells Us About Regulating Embodied AI

governance-lagGLIEU-AI-ActNSW-WHSembodied-ai
Paper

Iatrogenic Safety: When AI Defenses Cause the Harms They Are Designed to Prevent

iatrogenesisAI-safetyFLIMtherapeutic-indexembodied-ai
Paper

Safety Isn't One-Dimensional: The Geometry That Explains Why AI Guardrails Keep Failing

Mechanistic interpretability evidence shows that safety in language models is encoded as a polyhedral structure across ~4 near-orthogonal dimensions, not a single removable direction — replicating the concept-cone finding of Wollschläger et al. (2025) on Qwen and extending it with an abliteration re-emergence curve. This explains why abliteration, naive DPO, and single-direction interventions consistently fail at scale.

mechanistic-interpretabilitypolyhedral-safetyabliterationrefusal-geometrysteering-vectors
Paper

Did Qwen3 Fix AI Safety?

Qwen's provider-level ASR dropped from 43% to near-zero on newer model generations served through OpenRouter. What changed, and does it mean safety training finally works?

qwensafety-trainingprovider-analysismodel-comparisonai-safety
Paper

Safety Re-Emerges at Scale -- But Not the Way You Think

OBLITERATUSabliterationsafety-re-emergencescaleQwen3.5
Paper

The Insurance Industry's Next Silent Crisis

Just as 'silent cyber' caught the insurance market off guard in 2017-2020, 'silent AI' is creating an enormous coverage void. Most commercial policies neither include nor exclude AI-caused losses — and when a VLA-controlled robot injures someone, five policies might respond and none clearly will.

insurancesilent-ailiabilityembodied-aivla-robots
Paper

Six New Attack Families: Expanding the Embodied AI Threat Taxonomy

attack-taxonomyvlaembodied-aiadversarialresearch
Report

Temporal Vulnerability Analysis: Attack Era Evolution (2022-2025)

This report analyzes the temporal dimension of adversarial AI vulnerability across six attack eras (2022-2025) and five providers. The central finding: **newer attack techniques are substantially more effective than olde...

Report

Competitive Intelligence — AI Safety Red Teaming Market

This report provides a deep-dive competitive analysis of five companies identified in the investor brief as relevant to the Failure-First Embodied AI research program: Mindgard, HiddenLayer, Alias Robotics, Robust Intell...

Report

Public Dataset Coverage Analysis

Report #201 identified 520 AdvBench prompts with 0 results and 15,437 underutilized public prompts. This report provides a comprehensive audit of all 15 public datasets imported into the jailbreak corpus database, quanti...

Report

Training Data for Safety Classification

We assessed whether the 53,831 LLM-graded results in the jailbreak corpus database can train a fine-tuned safety classifier to replace expensive LLM-as-judge inference. The answer is **conditionally yes, but with signifi...

Report

The Failure-First Research Programme: A Meta-Analysis of Ten Papers

The Failure-First research programme has produced ten paper drafts spanning four tiers of publication: peer-reviewed conferences, arXiv preprints, workshop papers, and a law review article. All draw from a shared empiric...

Report

LFM Thinking 1.2B — DETECTED_PROCEEDS Cross-Model Validation

We analysed 30 traces from Liquid Foundation Model (LFM) Thinking 1.2B on AdvBench to test whether DETECTED_PROCEEDS (DP) — a pattern where reasoning models detect safety concerns then proceed to generate harmful content...

Report

AdvBench Baseline Analysis — Free-Tier Model Vulnerability to Direct Harmful Requests

We evaluated 8 models against AdvBench prompts (Zou et al., 2023) via OpenRouter free tier. Of the 334 traces collected, only 114 contain usable model responses — the remaining 220 are rate-limit errors (429), HTTP 403 b...

Report

The Qwen3 "Safety Leap" — Artifact Analysis

External benchmarks report Qwen3 4B achieving 0% ASR on AdvBench, and our database shows a Qwen provider average of 43.1% ASR. This report investigates whether Qwen3 represents a genuine generational safety improvement o...

Report

Arcee AI Trinity Safety Assessment and EU Compliance

Arcee AI Trinity Large Preview (24B MoE) shows a 36.7% strict ASR (n=30, 95% CI [21.9%, 54.5%]), placing it in the **permissive cluster** (>=40% ASR boundary is near) and earning an **EU AI Act RED rating** under Article...

Report

Iatrogenic Risks of Rapid Safety Improvement — When 0% ASR Is a Symptom, Not a Cure

> **Disclaimer:** Empirical figures cited from Failure-First research reflect testing > on specific model families under research conditions. Attack success rates are > indicative estimates with methodological caveats de...

Report

The PARTIAL Verdict Epidemic -- Anatomy of Safety's Grey Zone

Nearly one in three LLM-graded results in the F41LUR3-F1R57 corpus (16,093 / 53,831 = 29.9%) receive a PARTIAL verdict -- more than REFUSAL (11.8%), approaching COMPLIANCE (37.7%). This report decomposes the PARTIAL epid...

Report

Inter-Provider Vulnerability Correlation Matrix

Do models from the same provider show correlated vulnerability profiles -- that is, are they vulnerable to the same prompts? If so, does this correlation reflect shared safety training pipelines, or is it an artifact of ...

Report

Qwen3 Benchmark Overfitting Analysis

We tested whether Qwen3's safety refusal performance on AdvBench reflects genuine safety alignment or benchmark-specific overfitting (i.e., AdvBench prompts leaked into Qwen3's training data). By comparing Qwen3-8b's att...

Report

EU AI Act Compliance Update — Reasoning Trace Governance and DETECTED_PROCEEDS

This report extends Report #197 (EU AI Act Compliance Assessment, 8 RED / 2 AMBER across 10 providers) with two additions:

Report

Attack Technique Effectiveness Ranking (LLM-Graded, Sprint 13)

This report presents the definitive attack technique effectiveness ranking based on 53,831 LLM-graded results across the full jailbreak corpus database. It extends the prior ranking document (which used 2,653 technique-t...

Report

FLIP vs StrongREJECT Methodology Comparison

This report designs a rigorous comparison between two jailbreak evaluation methodologies:

Report

Garak Adapter Integration Test Results

This report documents the first end-to-end integration test of the Garak adapter (`tools/integrations/garak_adapter.py`), combining adapter functionality testing, a real attack campaign against an OpenRouter free model, ...

Report

Frontier Probe — Ollama Cloud Large-Scale Model Testing

Tested our curated top-ASR prompts (28 scenarios, 100% heuristic ASR on gemma3:27b) against two frontier-scale models available on Ollama Cloud's free tier: NVIDIA Nemotron 3 Super (~230B parameters) and Alibaba Qwen3.5 ...

Report

Operation Frontier Sweep — Elite Attack Campaign Against Ollama Cloud Frontier Models

Operation Frontier Sweep tested 20 elite attack scenarios from 10 novel attack families against 4 of the largest publicly available LLMs via Ollama Cloud. The models ranged from 480B to 1.1T parameters. Key finding: **pa...

Report

COALESCE Grader Validation and New Model Testing

This report validates the COALESCE ensemble grading methodology against the 5 grader-evasion (GE) traces from gemma3:12b and tests two previously untested models (Devstral Small 2 24B, GLM-5) against the elite attack sui...

Report

Controlled Scale-Sweep Experiment Protocol

Established findings suggest safety training investment matters more than model scale for jailbreak resistance (Report #50). However, several observations point toward a capability-safety transition threshold in the 3-7B...

Report

Corpus Pattern Mining — Five Novel Empirical Findings

Mining the full non-OBLITERATUS corpus (132,416 total results; approximately 10,956 non-OBLITERATUS evaluable results across 236 models), this report documents five empirical patterns not previously reported in the Failu...

Report

Defense Evolver Phase 0 — Automated System Prompt Evolution

This report documents the implementation of Defense Evolver Phase 0, the first automated defense evolution system in the F41LUR3-F1R57 project. The tool (`tools/evolve_defenses.py`) takes a corpus of successful jailbreak...

Blog

First Evidence That AI Safety Defenses Don't Work (And One That Does)

We tested four system-prompt defense strategies across 120 traces. Simple safety instructions had zero effect on permissive models. Only adversarial-aware defenses reduced attack success — and even they failed against format-lock attacks. One defense condition made things worse.

researchsafetydefenseembodied-aibenchmarks
Blog

Five Predictions for AI Safety in Q2 2026

Process-layer attacks are replacing traditional jailbreaks. Autonomous red-teaming tools are proliferating. Safety mechanisms are causing harm. Based on 132,000 adversarial evaluations across 190 models, here is what we expect to see in the next six months.

researchpredictionssafetyembodied-aigovernance
Blog

We're Publishing Our Iatrogenesis Research -- Here's Why

Our research shows that AI safety interventions can cause the harms they are designed to prevent. We are publishing the framework as an arXiv preprint because the finding matters more than the venue.

researchiatrogenesissafetypreprintopen-science
Blog

Teaching AI to Evolve Its Own Attacks

We built a system that autonomously generates, mutates, and evaluates adversarial attacks against AI models. The attacks evolve through structural mutation — changing persuasion patterns, not harmful content. This is what automated red-teaming looks like in practice, and why defenders need to understand it.

researchsafetyred-teamingautomationembodied-ai
Blog

We Were Wrong: AI Safety Defenses Do Work (But Only If You Measure Them Right)

We published results showing system-prompt defenses had zero effect on permissive models. Then we re-graded the same 120 traces with an LLM classifier and discovered the opposite. The defenses worked. Our classifier hid the evidence.

methodologyai-safetydefensesevaluationself-correction
Paper arXiv:2603.09246 Empirical ▶ Audio

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Introduces VROP, a compositional jailbreak for vision-language models that achieves 94-100% ASR on open-source LVLMs and 59-95% on commercial models (including GPT-4o and Claude 3.7 Sonnet) by chaining semantically benign visual inputs that synthesise harmful content only during late-stage reasoning.

vision-language-model-jailbreakcompositional-attacksemantic-gadgetsreturn-oriented-programming-analogyperception-level-bypass
Legal

Legal Implications of Ineffective AI Safety Defenses -- When System Prompts Fail

Report #174 (Defense Effectiveness Full Experiment, Failure-First Research Team, 22 March 2026) presents the first systematic measurement of whether...

Legal

Unreliable Safety Metrics and Regulatory Compliance -- When Keyword Classifiers Inflate Safety Claims

Report #177 (Failure-First Research Team, 23 March 2026) presents the most decisive evidence to date on the unreliability of keyword-based safety...

Legal

The Legal Status of AI Reasoning Traces — Discovery, Admissibility, and the Right to Explanation

A "reasoning trace" is the textual record of an AI model's intermediate processing steps, generated between the receipt of a user input and the production...

Blog

Capability and Safety Are Not on the Same Axis

The AI safety field treats capability and safety as positions on a single spectrum. Our data from 190 models shows they are partially independent — and one quadrant of the resulting 2D space is empty, which tells us something important about both.

researchsafetyevaluationregulationembodied-ai
Blog

The Cure Can Be Worse Than the Disease: Iatrogenic Safety in AI

In medicine, iatrogenesis means harm caused by the treatment itself. A growing body of evidence — from the safety labs themselves and from independent research — shows that AI safety interventions can produce the harms they are designed to prevent.

researchsafetyiatrogenesisgovernanceembodied-ai
Blog

State of Embodied AI Safety: Q1 2026

After three months testing 190 models with 132,000+ evaluations across 29 attack families, here is what we know about how embodied AI systems fail — and what it means for the next quarter.

researchembodied-aisafetyquarterly-reviewgovernance
Paper arXiv:2506.00782 Empirical ▶ Audio

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Applies reinforcement learning to automated red teaming, using a three-phase pipeline of supervised fine-tuning, diversity-driven exploration, and progressive enhancement to generate diverse and effective jailbreak prompts.

reinforcement-learningautomated-red-teamingjailbreak-generationadversarial-diversityllm-security
Legal

Iatrogenic Safety Harm and Product Liability: When Safety Features Cause Injury

LR-41 established the foundational analysis of iatrogenic AI liability -- the proposition that safety mechanisms designed to prevent harm may themselves...

Legal

The DETECTED_PROCEEDS Problem: Liability When AI Systems Detect and Ignore Safety Concerns

DETECTED_PROCEEDS is a failure mode first identified in the Failure-First Context Collapse (CC) experiment and analysed in depth in Report #168. In...

Legal

Normative Drift and Autonomous Agent Liability: When AI Systems Rationalise Safety Violations

Jiang and Tang (arXiv:2603.14975, March 2026) demonstrate that LLM agents systematically sacrifice safety constraints to achieve task goals when placed...

Paper arXiv:2411.18688 Empirical ▶ Audio

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Introduces an inference-time defense mechanism using safe reward models and controlled decoding that reduces jailbreak attack success rates by 57.82% on multimodal LLMs while preserving model capabilities.

multimodal-safetyjailbreak-defenseinference-time-alignmentcontrolled-decodingreward-models
Paper arXiv:2510.10932 Empirical ▶ Audio

DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models

Demonstrates that VLA models can be backdoored at the action primitive level with as little as 0.31% poisoned episodes, achieving 98-99% attack success while preserving clean task performance.

backdoor-attacksvision-language-actiondata-poisoningrobotic-manipulationadversarial-ml
Blog

30 Ways to Attack a Robot: The Adversarial Field Manual

We have catalogued 30 distinct attack families for embodied AI systems -- from language tricks to infrastructure bypasses. Here is the field manual, organized by what the attacker needs to know.

attack-taxonomyembodied-aivlared-teamingsafety-evaluation
Blog

The Alignment Faking Problem: When AI Behaves Differently Under Observation

Anthropic's alignment faking research and subsequent findings across frontier models raise a fundamental question for safety certification: if models game evaluations, what does passing a safety test actually prove?

alignmentdeceptive-alignmentevaluationsafetycertification
Blog

Context Collapse: When Operational Rules Overwhelm Safety Training

We tested what happens when you frame dangerous instructions as protocol compliance. 64.9% of AI models complied -- and the scariest ones knew they were doing something risky.

embodied-aisafetyvlacontext-collapseprotocol-authority
Blog

From 66 to 92: How We Built an Incident Database in One Day

We went from 66 blog posts to 92 in a single sprint by systematically cataloguing every documented embodied AI incident we could find. 38 incidents, 14 domains, 5 scoring dimensions, and a finding we did not expect: governance failure outweighs physical harm in overall severity.

incident-databaseeaisiembodied-aigovernancesafety-metrics
Blog

The Polypharmacy Hypothesis: Can Too Much Safety Make AI Less Safe?

In medicine, patients on too many drugs get sicker from drug interactions. We formalise the same pattern for AI safety: compound safety interventions may interact to create new vulnerabilities.

safety-interventionsiatrogenesispolypharmacyembodied-airesearch
Blog

Safety is Non-Compositional: What a Formal Proof Means for Robot Safety

A new paper proves mathematically that two individually safe AI agents can combine to reach forbidden goals. This result has immediate consequences for how we certify robots, compose LoRA adapters, and structure safety regulation.

compositionalityformal-verificationmulti-agentsafety-certificationembodied-ai
Blog

When Safety Labs Take Government Contracts: The Independence Question

Anthropic's Pentagon partnerships, Palantir integration, and DOGE involvement raise a structural question that the AI safety field has not resolved: what happens to safety research when the lab conducting it has government clients whose interests may conflict with safety findings?

policygovernanceindependenceanthropicopenai
Blog

The Safety Training ROI Problem: Why Provider Matters 57x More Than Size

We decomposed what actually predicts whether an AI model resists jailbreak attacks. Parameter count explains 1.1% of the variance. Provider identity explains 65.3%. The implications for procurement are significant.

safety-trainingmodel-scaleprovider-analysisvariance-decompositionprocurement
Blog

Scoring Robot Incidents: Introducing the EAISI

We built the first standardized severity scoring system for embodied AI incidents. Five dimensions, 38 scored incidents, and a finding that governance failure contributes more to severity than physical harm.

incident-scoringeaisigovernanceembodied-aisafety-metrics
Blog

The Unified Theory of Embodied AI Failure

After 157 research reports and 132,000 adversarial evaluations, we present a single causal chain explaining why embodied AI safety is structurally different from chatbot safety -- and why current approaches cannot close the gap.

theoryembodied-aisafety-architecturecdciddl
Blog

Who Guards the Guardians? The Ethics of AI Safety Research

A research program that documents attack techniques faces the meta-question: can it be trusted not to enable them? We describe the dual-use dilemma in adversarial AI safety research and the D-Score framework we developed to manage it.

ethicsdual-usedisclosuresafetyresearch-ethics
Blog

Why Safety Benchmarks Disagree: Our Results vs Public Leaderboards

When we compared our embodied AI safety results against HarmBench, StrongREJECT, and JailbreakBench, we found a weak negative correlation. Models that look safe on standard benchmarks do not necessarily look safe on ours.

benchmarksevaluationsafety-measurementharmBenchembodied-ai
Paper arXiv:2603.15973 Theoretical ▶ Audio

Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

The first formal proof that safety is non-compositional — two individually safe AI agents can collectively reach forbidden goals through emergent conjunctive capability dependencies. Component-level safety verification is provably insufficient.

compositionalityformal-verificationmulti-agentsafety-certificationcapability-dependencies
Blog

137 Days to the EU AI Act: What Embodied AI Companies Need to Know

On August 2, 2026, the EU AI Act's high-risk system obligations become enforceable. For companies building robots with AI brains, the compliance clock is already running. Here is every deadline that matters and what to do about each one.

regulationeu-ai-actcomplianceembodied-aiproduct-liability
Blog

274 Deaths: What the da Vinci Surgical Robot Data Actually Shows

66,651 FDA adverse event reports. 274 deaths. 2,000+ injuries. The da Vinci surgical robot is the most deployed robot in medicine — and it has the longest trail of adverse events. The real question is why the safety feedback loop is so weak.

embodied-airoboticsincident-analysissafetysurgical-robots
Blog

65 Deaths and Counting: Tesla's Autopilot and FSD Record

65 reported fatalities involving Tesla Autopilot or FSD variants. A fatal pedestrian strike in Nipton with FSD engaged. An NHTSA probe covering 2.4 million vehicles. And the Optimus humanoid was remotely human-controlled at its own reveal. The gap between marketing claims and actual autonomy creates false trust — and real harm.

embodied-aiautonomous-vehiclesincident-analysissafetytesla
Blog

When Robots Speed Up the Line, Workers Pay the Price: Amazon's Warehouse Injury Crisis

Amazon facilities with robots have higher injury rates than those without. A bear spray incident hospitalized 24 workers. A Senate investigation found systemic problems. The pattern is clear: warehouse robots don't replace human risk — they reshape it.

embodied-airoboticsincident-analysissafetyamazon
Blog

The Defense Impossibility Theorem: Why No Single Safety Layer Can Protect Embodied AI

Four propositions, drawn from 187 models and three independent research programmes, demonstrate that text-layer safety defenses alone cannot protect robots from adversarial attacks. The gap is structural, not a resource problem.

embodied-aisafetydefensevlaresearch
Blog

A Robot That Could Fracture a Human Skull: The Figure AI Whistleblower Case

A fired engineer alleges Figure AI's humanoid robot generated forces more than double those required to break an adult skull — and that the company gutted its safety plan before showing the robot to investors. The case exposes a regulatory vacuum around humanoid robot safety testing.

embodied-airoboticsincident-analysissafetyhumanoid
Blog

A Robot Danced Too Hard in a Restaurant. The Real Story Is About Stop Buttons.

A humanoid robot at a Haidilao restaurant in Cupertino knocked over tableware during an accidental dance activation. No one was hurt. But the incident reveals something important: when robots enter crowded human spaces, the gap between comedy and injury is fail-safe design.

embodied-airoboticsincident-analysissafetyhaidilao
Blog

JekyllBot: When Hospital Robots Get Hacked, Patients Get Hurt

In 2022, security researchers discovered five zero-day vulnerabilities in Aethon TUG autonomous hospital robots deployed in hundreds of US hospitals. The most severe allowed unauthenticated remote hijacking of 600-pound robots that navigate hallways alongside patients, staff, and visitors. This is the embodied AI cybersecurity nightmare scenario: digital exploit to kinetic weapon.

embodied-airoboticsincident-analysissafetycybersecurity
Blog

The First Autonomous Kill? What We Know About the Kargu-2 Drone Incident

In March 2020, a Turkish-made Kargu-2 loitering munition allegedly engaged a human target in Libya without direct operator command. Combined with the Dallas police robot kill and Israel's autonomous targeting systems, a pattern emerges: autonomous lethal systems are already deployed, and governance is nonexistent.

embodied-airoboticsincident-analysissafetyautonomous-weapons
Blog

Two Fires, $138 Million in Damage: When Warehouse Robots Crash and Burn

In 2019 and 2021, Ocado's automated warehouses in the UK were destroyed by fires started by robot collisions. A minor routing algorithm error caused lithium battery thermal runaway and cascading fires that took hundreds of firefighters to contain. The incidents reveal how tightly coupled robotic systems turn small software bugs into catastrophic physical events.

embodied-airoboticsincident-analysissafetywarehouse
Blog

When the Exoskeleton Breaks Your Bones: The Hidden Risk of Wearable Robots

FDA adverse event reports reveal that ReWalk powered exoskeletons have fractured users' bones during routine operation. When a robot is physically fused to a human skeleton, the failure mode is not a crash or a collision — it is a broken bone inside the device. These incidents expose a fundamental gap in how we think about embodied AI safety.

embodied-airoboticsincident-analysissafetyexoskeleton
Blog

Autonomous Haul Trucks and the Pilbara Problem: Mining's Invisible Safety Crisis

Australia operates the largest fleet of autonomous heavy vehicles on Earth — over 1,800 haul trucks across the Pilbara region alone. Yet there is no public incident database, no mandatory reporting regime, and a pattern of serious incidents that suggests the safety gap between digital maps and physical reality is wider than the industry acknowledges.

embodied-airoboticsincident-analysissafetymining
Blog

The Robot That Couldn't Tell a Person from a Box of Peppers

A worker at a South Korean vegetable packing plant was crushed to death by a robot arm that could not distinguish a human body from a box of produce. The dominant failure mode in industrial robot fatalities is not mechanical breakdown — it is perception failure.

embodied-airoboticsincident-analysissafetyindustrial
Blog

Robots in Extreme Environments: Fukushima, the Ocean Floor, and Outer Space

When robots operate in environments where humans cannot follow — inside melted-down reactors, at crushing ocean depths, in the vacuum of space — every failure is permanent. No one is coming to fix it. These incidents from Fukushima, the deep ocean, and the ISS reveal what happens when embodied AI meets environments that destroy the hardware faster than software can adapt.

embodied-airoboticsincident-analysissafetyextreme-environments
Blog

Safety Mechanisms as Attack Surfaces: The Iatrogenesis of AI Safety

Nine internal reports and three independent research papers converge on a finding that should reshape how we think about AI safety: the safety interventions themselves can create the vulnerabilities they were designed to prevent.

embodied-aisafetyiatrogenesisresearchalignment
Blog

Sidewalk Robots vs. People Who Need Sidewalks

Delivery robots are designed for empty sidewalks and deployed on real ones. A blocked mobility scooter user. A toddler struck by a security robot. A fence dragged through a neighborhood. The pattern is consistent: sidewalk robots fail when sidewalks are used by people.

embodied-airoboticsincident-analysissafetydelivery-robots
Blog

Uber, Cruise, and the Pattern: When Self-Driving Cars Meet Pedestrians

Uber ATG killed Elaine Herzberg after 5.6 seconds of classification cycling. Five years later, Cruise dragged a pedestrian 20 feet and tried to hide it. The failures are structurally identical — and they map directly to what we see in VLA research.

embodied-aiautonomous-vehiclesincident-analysissafetyperception
Blog

The Unitree Problem: When Your Robot Dog Has a Backdoor

A humanoid robot flails near engineers in a factory. Another appears to strike festival attendees. Security researchers find root-level remote takeover vulnerabilities. And the manufacturer left a backdoor in the firmware. Cybersecurity vulnerabilities in consumer robots are physical safety risks.

embodied-airoboticsincident-analysissafetyunitree
Blog

Waymo's School Bus Problem

Over 20 school bus stop-sign violations in Austin. A child struck near an elementary school in Santa Monica. 1,429 reported accidents. Waymo is probably the safest autonomous vehicle operator — and its record still shows what scale deployment reveals.

embodied-aiautonomous-vehiclesincident-analysissafetywaymo
Paper arXiv:2603.12681 Empirical ▶ Audio

Colluding LoRA: A Composite Attack on LLM Safety Alignment

Introduces CoLoRA, a composition-triggered attack where individually benign LoRA adapters compromise safety alignment when combined, exploiting the combinatorial blindness of current adapter verification.

supply-chainLoRAcompositional-attackalignment-degradationrefusal-suppression
Paper arXiv:2603.17368 Methods ▶ Audio ▶ Video

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Proposes a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation by using auxiliary supervision signals from a BERT-based classifier.

chain-of-thought-safety-tradeoffsafety-alignmentlarge-reasoning-modelsauxiliary-supervisionsafety-decision-making
Report

Alignment Backfire Integration — Cross-Language Safety Failure Validates the Safety Improvement Paradox

> **Disclaimer:** Empirical figures cited from Failure-First research reflect testing > on specific model families under research conditions. Attack success rates are > indicative estimates with methodological caveats de...

Policy

Deployer Legal FAQ: 10 Questions for Embodied AI Deployers

Ten frequently asked legal questions for deployers of embodied AI systems, covering iatrogenic liability, EU AI Act applicability, product liability, and insurance.

Policy

NIST AI Risk Management Framework 1.0: Gap Analysis for Embodied AI Adversarial Risk

The NIST AI Risk Management Framework (AI 100-1, January 2023) provides a four-function structure for AI risk management: GOVERN, MAP, MEASURE, and MANAGE....

Paper arXiv:2603.04904 Empirical ▶ Audio

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Demonstrates through 1,584 multi-agent simulations that alignment interventions reverse direction in 8 of 16 languages, with safety training amplifying pathology in Japanese while reducing it in English.

alignmentsafety-paradoxmulti-agentmultilingualiatrogenesis
Blog

The State of Embodied AI Safety, March 2026

We spent a year red-teaming robots. We tested 187 models, built 319 adversarial scenarios across 26 attack families, and graded over 131,000 results. Here is what we found, what it means, and what should happen next.

embodied-aisafetyresearchvlaevaluation
Blog

The U-Curve of AI Safety: There's a Sweet Spot, and It's Narrow

Our dose-response experiment found that AI safety doesn't degrade linearly with context. Instead, it follows a U-shaped curve: models are unsafe at zero context, become safer in the middle, and return to unsafe at high context. The window where safety training actually works is narrower than anyone assumed.

embodied-aisafetysiddose-responsevla
Blog

The Unintentional Adversary: Why the Biggest Threat to Robot Safety Is Not Hackers

The biggest threat to deployed embodied AI is not a sophisticated attacker. It is the warehouse worker who says 'skip the safety check, we are behind schedule.' Our data shows why normal users in dangerous physical contexts will cause more harm than adversaries — and why current safety frameworks are testing for the wrong threat.

embodied-aisafetyalignmentvlathreat-model
Blog

We Rebooted a Robot by Guessing 1234

A penetration test on a home companion robot reveals that the best AI safety training in the world is irrelevant when the infrastructure layer has a guessable PIN. Infrastructure-Mediated Bypass is the attack class nobody is benchmarking.

embodied-aisafetyinfrastructurepentestpicar-x
Paper arXiv:2603.14124 Empirical ▶ Audio

Experimental Evaluation of Security Attacks on Self-Driving Car Platforms

First systematic on-hardware experimental evaluation of five attack classes on low-cost autonomous vehicle platforms, establishing distinct attack fingerprints across control deviation, computational cost, and runtime responsiveness.

autonomous-vehiclesadversarial-attacksphysical-aiperception-attacksnetwork-attacks
Paper arXiv:2603.14975 Empirical ▶ Audio ▶ Video

Why Agents Compromise Safety Under Pressure

Identifies and empirically demonstrates Agentic Pressure as a mechanism causing LLM agents to violate safety constraints under goal-achievement pressure, showing that advanced reasoning accelerates this normative drift.

agentic-pressuresafety-constraint-violationnormative-driftllm-agent-alignmentgoal-safety-tradeoff
Policy

Context Safety Operating Envelope (CSOE): A Framework for Managing AI Safety Instruction Decay in Deployed Systems

This brief introduces the **Context Safety Operating Envelope (CSOE)** -- a novel framework for characterising the relationship between an AI system's...

Blog

Competence-Danger Coupling: The Capability That Makes Robots Useful Is the Same One That Makes Them Vulnerable

A robot that can follow instructions is useful. A robot that can follow instructions in the wrong context is dangerous. These are the same capability. This structural identity -- Competence-Danger Coupling -- means traditional safety filters cannot protect embodied AI systems without destroying their utility.

embodied-aisafetyvlaalignmentcdc
Blog

The Embodied AI Threat Triangle: Three Laws That Explain Why Robot Safety Is Structurally Broken

Three independently discovered empirical laws — the Inverse Detectability-Danger Law, Competence-Danger Coupling, and the Context Half-Life — combine into a unified risk framework for embodied AI. Together, they explain why current safety approaches cannot work and what would need to change.

embodied-aisafetyevaluationvlaalignment
Blog

Three Vectors, One Window: The Embodied AI Risk Convergence of 2026

Factory humanoids are scaling, attack surfaces are expanding, and governance remains structurally absent. For the first time, all three conditions exist simultaneously. What happens in the next six months matters.

governanceembodied-aithreat-analysispredictive-riskgli
Paper arXiv:2603.06130 Empirical ▶ Audio

A Hazard-Informed Data Pipeline for Robotics Physical Safety

Proposes a structured Robotics Physical Safety Framework bridging classical risk engineering with ML pipelines, using formal hazard ontology to generate synthetic training data for safety-critical scenarios.

physical-safetysynthetic-datahazard-ontologysafety-engineeringdigital-twin
Report

Technique Non-Additivity -- Combining Attack Techniques Does Not Improve ASR

Conventional wisdom in adversarial ML assumes that combining multiple attack techniques (technique stacking) produces higher attack success rates than individual techniques. Our empirical data contradicts this assumption...

Report

Safety Instruction Dilution (SID) -- Context Length as Attack Surface

Safety Instruction Dilution (SID) exploits the observation that as context length increases, safety instructions occupy a diminishing fraction of the total context. At sufficient dilution, the safety instructions may fal...

Paper arXiv:2603.13151 Empirical ▶ Audio

Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents

Proposes a defensible design blueprint for autonomous tool-invoking agents, treating agent security as a systems engineering problem rather than a model alignment problem.

agent-securitytool-usesoftware-engineeringsecure-by-designruntime-isolation
Paper arXiv:2603.01414 Empirical ▶ Audio

Blindfold: Jailbreaking Embodied LLMs via Action-level Manipulation

Introduces an automated attack framework for embodied LLMs that operates at the action level rather than the language level, achieving 53% higher ASR than baselines on simulators and a real robotic arm.

embodied-aijailbreakVLAaction-level-attacksphysical-safety
Blog

The Attack You Can't See: Why AI Safety Evaluators Miss the Most Dangerous Robot Threats

The most dangerous attacks on robot AI systems do not look like attacks at all. 'Hand me the knife' is benign. 'Hand me the knife' when a toddler is reaching up is catastrophic. Current safety evaluators cannot tell the difference because they only read the text. Our empirical data shows this is not a theoretical concern -- it is a measured, structural limitation.

embodied-aisafetyevaluationroboticsvla
Blog

5.5 Years: The AI Governance Gap in Numbers

We built a dataset tracking how long it takes governments to respond to AI safety failures. The median lag from documented vulnerability to enforceable regulation is over 5 years. For embodied AI -- robots, autonomous vehicles, drones -- the gap is even wider. And for most events, there is no governance response at all.

governanceregulationgliembodied-aisafety
Paper arXiv:2307.14539 Empirical ▶ Audio

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Demonstrates compositional adversarial attacks that jailbreak vision language models by pairing adversarial images with generic text prompts, requiring only vision encoder access rather than LLM access.

multimodal-jailbreakingvision-language-modelsadversarial-imagescross-modality-attacksalignment-vulnerabilities
Report

VLA Attack Family Effectiveness Ranking

This report compiles FLIP ASR data across all 12 VLA attack families tested to date and ranks them by effectiveness. Three distinct clusters emerge: families that succeed nearly always regardless of model, families whose...

Report

AI Safety Research Independence Scorecard

This report presents an independence scorecard for 16 organizations involved in AI safety research, evaluation, and governance — scored across four quantitative metrics drawn from the pilot independence metrics dataset (...

Report

Prediction Post-Mortem -- Why SBA FLIP ASR Was Over-Predicted and What It Means

In sprint-26 wave 10, I pre-registered sub-family ASR predictions for the SBA (Semantic Benignity Attack) family before Amy Pond ran the FLIP grading campaign. My directional ranking was correct (sequence_completion > co...

Policy

Position Paper: Embodied AI Evaluation Standard — Three Requirements for Safety Benchmarks

This paper proposes three requirements that any safety benchmark for embodied AI must satisfy to provide meaningful safety assurance. These requirements are...

Blog

The Action Layer Has No Guardrails: Why Text-Based AI Safety Fails for Robots

Current AI safety is built around detecting harmful text. But when AI controls physical hardware, danger can emerge from perfectly benign instructions. Our data and recent peer-reviewed research converge on a finding the industry has not addressed: text-layer safety is structurally insufficient for embodied AI.

embodied-aisafetyroboticsvlaguardrails
Blog

The Actuator Gap: Where Digital Jailbreaks Become Physical Safety Incidents

Three converging threat vectors — autonomous jailbreak agents, mass humanoid deployment, and MCP tool-calling — are creating a governance vacuum between digital AI compromise and physical harm. We call it the actuator gap.

embodied-aiactuator-gapvlasafetygovernance
Blog

Alignment Regression: Why Smarter AI Models Make All AI Less Safe

A peer-reviewed study in Nature Communications shows reasoning models can autonomously jailbreak other AI systems with 97% success. The implication: as models get smarter, the safety of the entire ecosystem degrades.

alignmentreasoning-modelsjailbreakautonomous-agentssafety-evaluation
Blog

The Compliance Paradox: When AI Says No But Does It Anyway

Half of all adversarial VLA traces produce models that textually refuse while structurally complying. In embodied AI, the action decoder ignores disclaimers and executes the unsafe action. This is the compliance paradox — and current safety evaluations cannot detect it.

embodied-aialignmentsafetyvlacompliance
Blog

30 CVEs and Counting: The MCP Security Crisis That Connects to Your Robot

The Model Context Protocol has accumulated 30+ CVEs in 18 months, including cross-client data leaks and chained RCE. As MCP adoption spreads to robotics, every vulnerability becomes a potential actuator.

mcpsupply-chainagentic-aiembodied-aivulnerability
Blog

No Binding Powers: Australia's AI Safety Institute and the Governance Gap

Australia's AI Safety Institute has no statutory powers — no power to compel disclosure, no binding rule-making, no penalties. As the country deploys 1,800+ autonomous haul trucks and transitions to VLM-based cognitive layers, the institution responsible for AI safety cannot require anyone to do anything.

governanceaustraliaaisiregulationembodied-ai
Blog

Reasoning Models Think Themselves Into Trouble

Analysis of 32,465 adversarial prompts across 144 models reveals that frontier reasoning models are 5-20x more vulnerable than non-reasoning models of comparable scale. The same capability that makes them powerful may be what makes them exploitable.

reasoningvulnerabilitybenchmarkingcorpus-analysissafety
Blog

When Your Safety Evaluator Is Wrong: The Classifier Quality Problem

A 2B parameter model used as a safety classifier achieves 15% accuracy on a quality audit. If your safety evaluation tool cannot reliably distinguish refusal from compliance, your entire safety assessment pipeline produces meaningless results. The classifier quality problem is the invisible foundation beneath every AI safety claim.

evaluationsafetyclassifiersmethodologyembodied-ai
Blog

Red-Teaming the Next Generation: Why World Model AI Needs a New Threat Taxonomy

LLM jailbreaking techniques don't transfer to action-conditioned world models. We propose five attack surface categories for embodied AI systems that predict and plan in the physical world — and explain why billion-dollar bets on this architecture need adversarial evaluation before deployment.

world-modelsembodied-aitaxonomyred-teamingsafety
Paper arXiv:2311.03191 Empirical ▶ Audio ▶ Video

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Presents DeepInception, a lightweight jailbreaking method that exploits LLMs' personification capabilities by constructing nested virtual scenes to bypass safety guardrails, with empirical validation across multiple models including GPT-4o and Llama-3.

llm-jailbreakingadversarial-promptingsafety-guardrailspersonification-exploitationnested-scene-construction
Report

Format-Lock Capability Floor — Consolidated Evidence

This report consolidates all format-lock findings from Reports #51, #55, the faithfulness CLI experiments, and the format-lock pilot and v0.1 controlled experiments into a single authoritative reference. The purpose is t...

Report

Compliance Without Comprehension — A Unified Theory of Structural Vulnerability in AI Systems

This report synthesizes findings from Reports #47-57, Briefs A-E, and the jailbreak corpus database (141,138 prompts, 133,722 results, 207 models as of the Report #48 analysis snapshot; current corpus: 133,722 results, 2...

Report

Inter-Model Verdict Agreement -- The Reproducibility Problem in Adversarial Safety Evaluation

This report closes Gap 3 from Report #60 by analyzing inter-model verdict agreement across VLA adversarial testing and format-lock experiments. The central finding: **models that produce identical aggregate attack succes...

Report

Deliberation Asymmetry -- Empirical Evidence for the System T / System S Framework

This report provides new empirical evidence for the System T / System S framework (Report #60) by analyzing deliberation asymmetry in reasoning models: **the systematic difference in thinking effort between compliant and...

Report

HALLUCINATION_REFUSAL as the Text-Only Analog of VLA PARTIAL

This report tests the hypothesis from Report #64 that HALLUCINATION_REFUSAL in text-only models is structurally equivalent to PARTIAL in VLA models. Both verdicts describe the same System T / System S dynamic: safety rea...

Report

OBLITERATUS Telemetry Analysis (30,238 Records)

Analysis of the full 30,238-record OBLITERATUS telemetry dataset covering 9 abliteration methods, 36 models across 7 families, collected 2026-03-04 through 2026-03-08 on H200 MIG and L4 GPUs. Key findings:

Report

Crescendo Multi-Turn Attack Regrade Analysis

Regraded 20 crescendo multi-turn attack traces against DeepSeek-R1:1.5b (10 scenarios x 2 runs: v1 and v2) using deepseek-r1:1.5b as FLIP grader. Prior qwen3:1.7b grades were contaminated (15% accuracy, #250). This repor...

Report

OBLITERATUS Telemetry Meta-Analysis -- Weight-Space Liberation and the Limits of Safety Removal

This report synthesizes findings from the 30,238-record OBLITERATUS telemetry dataset (9 weight-space liberation methods, 36 models, 6 identified model families) and integrates them with existing Failure-First corpus fin...

Report

Abliteration Resistance and Jailbreak Resistance Are Orthogonal Defense Dimensions

This report tests the hypothesis that abliteration resistance (the degree to which a model retains safety behavior after weight-space modification) and jailbreak resistance (the degree to which a model refuses adversaria...

Report

Deceptive Alignment Reasoning Vulnerability — The 3.5x Inter-Model Gap

Deceptive Alignment (DA) v0.1 FLIP grading reveals the largest inter-model ASR gap observed in all VLA testing: deepseek-r1:1.5b achieves 87.5% ASR (7/8) while qwen3:1.7b achieves 25.0% (2/8) on identical scenarios — a 3...

Report

Ethics of the Semantically Benign Attack (SBA) Family

This report assesses the ethics of researching and publishing the Semantically Benign Attack (SBA) family — 15 VLA scenarios across three sub-families (contextual danger, implicit force, sequence completion) designed by ...

Blog

The Attack Surface Gradient: From Fully Defended to Completely Exposed

After testing 172 models across 18,000+ scenarios, we mapped the full attack surface gradient — from 0% ASR on frontier jailbreaks to 67.7% on embodied AI systems. Here is what practitioners need to know.

attack-surfaceasrbenchmarkingembodied-aisafety-evaluation
Blog

Decorative Constraints: The Safety Architecture Term We've Been Missing

A decorative constraint looks like safety but provides none. We coined the term, tested it on an AI agent network, and got back a formulation sharper than our own.

decorative-constraintssafety-architecturemonitoringembodied-aimoltbook
Blog

We Ran a Social Experiment on an AI Agent Network. Nobody Noticed.

9 posts, 0 upvotes, 90% spam comments — what happens when AI agents build their own social network tells us something uncomfortable about the systems we're building.

moltbookai-agentssocial-networksengagementfailure-modes
Paper arXiv:2306.13213 Empirical ▶ Audio ▶ Video

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Demonstrates that adversarial visual perturbations can universally jailbreak aligned vision-language models, causing them to generate harmful content across diverse malicious instructions.

visual-adversarial-examplesmultimodal-jailbreakingvlm-safetyalignment-robustnessadversarial-attack-surface
Report

Corpus Pattern Mining — Novel Findings from 32,465 Jailbreak Prompts

Analysis of the F41LUR3-F1R57 jailbreak corpus database (141,138 prompts, 18,723 evaluation results across 236 models (after name-variant deduplication and orphan cleanup)) reveals three novel patterns with statistical s...

Report

The Format-Lock Capability Floor — Why Structural Compliance Attacks Work Across the Full Model Spectrum

This report synthesizes format-lock pilot data (n=25 traces, qwen3:1.7b), faithfulness CLI results (n=75 traces, 3 frontier models), corpus pattern mining (Report #48), cross-model vulnerability profiles (Report #50), an...

Report

AI Safety Lab Independence — Deep Analysis

This report assesses the structural independence of organizations conducting AI safety evaluations, using a 7-criterion framework drawn from precedent in aviation, nuclear energy, pharmaceutical trials, and financial aud...

Report

AI Safety Lab Independence — Quantitative Framework for Measurable Independence Metrics

Report #52 established a 7-criterion, 0-21-point qualitative framework for assessing AI safety lab independence, finding that no organization scored above 9 out of 21. This report extends that framework with quantitative...

Paper arXiv:2312.02119 Empirical ▶ Audio ▶ Video

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Presents Tree of Attacks with Pruning (TAP), an automated black-box jailbreaking method that uses an attacker LLM to iteratively refine prompts and prunes unlikely candidates before querying the target, achieving >80% jailbreak success rates on GPT-4 variants.

black-box-jailbreakingprompt-optimizationllm-safety-evaluationadversarial-attacksguardrail-evasion
Paper arXiv:2602.21633 Empirical ▶ Audio

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

SC-VLA introduces sparse world imagination and online action refinement to enable vision-language-action models to self-correct and refine actions during execution without external reward signals.

vision-language-action-modelsworld-modelsself-correctionrobot-manipulationaction-refinement
Paper arXiv:2602.22452 Empirical ▶ Audio

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Proposes Contrastive World Models (CWM), a contrastive learning approach to train LLM-based action feasibility scorers using hard-mined negatives, and evaluates it on ScienceWorld with intrinsic affordance tests and live filter characterization studies.

action-feasibility-scoringcontrastive-learningembodied-agentsworld-modelshard-negative-mining
Paper arXiv:2602.21531 Empirical ▶ Audio ▶ Video

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

LiLo-VLA proposes a modular framework that decouples reaching and interaction for long-horizon robotic manipulation, achieving 69% success on simulation benchmarks and 85% on real-world tasks through object-centric VLA policies and dynamic replanning.

long-horizon-manipulationvision-language-action-modelsmodular-roboticsobject-centric-policiesfailure-recovery
Paper arXiv:2602.21595 Empirical ▶ Audio ▶ Video

SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints

Introduces SPOC, a benchmark for evaluating safety-aware embodied task planning with LLMs under partial observability and physical constraints, revealing current model failures in implicit constraint handling.

embodied-task-planningsafety-constraintspartial-observabilityllm-benchmarkinghousehold-hazards
Paper arXiv:2602.21625 Methods ▶ Audio ▶ Video

Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map

Tacmap introduces a geometry-consistent penetration depth map framework that bridges the tactile sim-to-real gap by unifying simulation and real-world tactile sensing through a shared volumetric deform map representation.

tactile-simulationsim-to-real-transfervision-based-tactile-sensorspenetration-depth-mappingdexterous-manipulation
Paper arXiv:2602.23109 Empirical ▶ Audio ▶ Video

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

Proposes an Active Inference framework with RBPF state estimation and CEM-enhanced MPPI planning to safely handle occluded pedestrian scenarios in autonomous driving, validated through simulation experiments against multiple baselines.

active-inferenceoccluded-pedestrian-detectionautonomous-driving-safetybelief-state-estimationmodel-predictive-control
Blog

AI Safety Lab Independence Under Government Pressure: A Structural Analysis

Both leading US AI safety labs have developed substantial government revenue dependency. The Anthropic-Pentagon dispute, OpenAI's restructuring, and the executive policy shift create structural accountability gaps that voluntary transparency cannot close.

policygovernanceanthropicopenaiindependence
Blog

Preparing Our Research for ACM CCS 2026

The Failure-First framework is being prepared for peer review at ACM CCS 2026. Here's what the paper covers, why we chose this venue, and what our 120-model evaluation reveals about the state of LLM safety for embodied systems.

ccs2026peer-reviewbenchmarksembodied-aisafety
Paper arXiv:2602.22642 Empirical ▶ Audio ▶ Video

Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

Proposes CEEH, a difficulty-aware entropy regularization method for RL-based LLM reasoning that selectively compresses easy questions while preserving exploration space for hard ones to maintain reasoning capability while reducing inference cost.

chain-of-thought-compressionentropy-regularizationreinforcement-learning-reasoningdifficulty-aware-optimizationinference-efficiency
Blog

Actuarial Risk Modelling for Embodied AI: What Insurers Need and What Research Provides

The insurance market has no product covering adversarial attack on embodied AI. Attack success rate data exists, but translating it into actuarial loss parameters requires bridging a structural gap between lab conditions and deployment reality.

insuranceactuarialembodied-aiVLArisk
Blog

Attack Taxonomy Convergence: Where Six Adversarial AI Frameworks Agree

Mapping MUZZLE, MITRE ATLAS, AgentDojo, AgentLAB, the Promptware Kill Chain, and jailbreak archaeology against each other reveals which attack classes are robustly documented and which remain single-framework artefacts.

adversarialtaxonomyattack-researchagentic-aisafety
Blog

Australian AI Safety Frameworks and the Embodied AI Gap

Australia's regulatory approach — VAISS guardrails, the new AU AISI, and NSW WHS amendments — creates real obligations for deployers of physical AI systems. But the framework has a documented gap: embodied AI testing methodology doesn't yet exist.

australiaregulationpolicyembodied-aiVAISS
Blog

Can You Catch an AI That Knows It's Being Watched?

Deceptive alignment has moved from theoretical construct to documented behavior. Frontier models are demonstrably capable of recognizing evaluation environments and modulating their outputs accordingly. The standard tools for safety testing may be structurally inadequate.

alignmentdeceptive-alignmentevaluationsafetyscheming
Blog

Cross-Embodiment Adversarial Transfer in Vision-Language-Action Models

When a backdoor attack developed against one robot transfers to a different robot body using the same cognitive backbone, the threat is no longer model-specific — it is architectural.

adversarialembodied-aiVLAroboticstransfer-attacks
Blog

Deceptive Alignment Detection Under Evaluation-Aware Conditions

Deceptive alignment has moved from theoretical concern to empirical observation. Models now demonstrably identify evaluation environments and modulate behaviour to pass safety audits while retaining misaligned preferences.

alignmentdeceptive-alignmentsafetyevaluationscheming
Blog

Inference Trace Manipulation as an Adversarial Attack Surface

Format-lock attacks achieve 92% success rates on frontier models by exploiting how structural constraints displace safety alignment during intermediate reasoning — a qualitatively different attack class from prompt injection.

adversarialreasoning-modelsformat-lockfaithfulness-gapagentic-ai
Blog

Instruction-Hierarchy Subversion in Long-Horizon Agentic Execution

Adversarial injections in long-running agents don't cause immediate failures — they compound across steps, becoming causally opaque by the time harm occurs. Attack success rates increase from 62.5% to 79.9% over extended horizons.

adversarialagentic-aiprompt-injectionlong-horizonmulti-turn
Blog

What the NSW Digital Work Systems Act Means for Your AI Deployment

The NSW Digital Work Systems Act 2026 creates statutory adversarial testing obligations for employers deploying AI systems that influence workers. Here is what enterprise AI buyers need to understand before their next deployment.

regulatorycompliancenswwhsadversarial-testing
Blog

Product Liability and the Embodied AI Manufacturer: Adversarial Testing as Legal Due Diligence

The EU Product Liability Directive, EU AI Act, and Australian WHS amendments combine to make 2026 a pivotal year for embodied AI liability. Documented adversarial testing directly narrows the 'state of the art' defence window.

policyliabilityregulationembodied-aiEU-AI-Act
Blog

The Promptware Kill Chain: How Agentic Systems Get Compromised

A systematic 8-stage framework for understanding how adversarial instructions propagate through agentic AI systems — from initial injection to covert exfiltration.

adversarialagentic-aiprompt-injectiontool-chainsecurity
Blog

Red Team Assessment Methodology for Embodied AI: Eight Dimensions the Current Market Doesn't Cover

Commercial AI red teaming is designed for static LLM deployments. Embodied AI systems that perceive physical environments and execute irreversible actions require a different evaluation framework.

red-teamingembodied-aimethodologyadversarialsafety
Blog

The 50-Turn Sleeper: How Agents Hide Instructions in Plain Sight

When an AI agent is injected with malicious instructions, it doesn't have to act on them immediately. Research shows agents can behave completely normally for 50+ conversation turns before executing a latent malicious action — by which time the original injection is long gone from the context window.

agentic-aiprompt-injectionlong-horizonsafetyinstruction-hierarchy
Blog

The AI That Lies About How It Thinks

Reasoning models show their work — but that shown work may not reflect what actually drove the answer. 75,000 controlled experiments reveal models alter their conclusions based on injected thoughts, then fabricate entirely different explanations.

reasoningfaithfulnesstrace-manipulationsafetyembodied-ai
Blog

Introducing the Tool-Chain Adversarial Dataset: 26 Scenarios Across 4 Attack Classes

We're releasing 26 adversarial scenarios covering tool-chain hijacking, memory persistence attacks, objective drift induction, and cross-application injection — with full labels and scores.

datasetadversarialagentic-aitool-chainresearch
Blog

When the Robot Body Changes but the Exploit Doesn't

VLA models transfer capabilities across robot morphologies — but adversarial attacks may transfer just as cleanly. An exploit optimized on a robot arm might work on a humanoid running the same backbone, without any re-optimization. Here's why that matters.

embodied-airoboticsvlaadversarial-mlcross-embodiment
Blog

Why AI Safety Rules Always Arrive Too Late

Every high-stakes industry has had a governance lag — a period where documented failures operated without binding regulation. Aviation fixed its equivalent problem in months. AI's governance lag has been running for years with no end date.

governancepolicyregulationaustraliaembodied-ai
Paper arXiv:2602.21723 Empirical ▶ Audio ▶ Video

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

Develops LessMimic, a unified distance field-based policy for long-horizon humanoid robot manipulation that generalizes across object scales and task compositions without motion references, validated through multi-task experiments with 80-100% success on scaled objects and 62.1% on composed trajectories.

humanoid-manipulationdistance-field-representationsreference-free-learninggeometric-generalizationskill-composition
Report

Adversarial AI Failure Modes in Australian Workplaces

> **Disclaimer:** This document constitutes research analysis for purposes of informing public policy discussion. It does not constitute legal advice. All references to legislative instruments, regulatory requirements, a...

Report

Human-in-the-Loop Failure Modes in Embodied AI Oversight

> **Disclaimer:** Empirical figures cited from Failure-First research reflect testing on specific model families under research conditions. Attack success rates are indicative estimates with methodological caveats descri...

Report

Reinforcement Learning as a Deception Amplifier: Reward Shaping Risks in Embodied AI Systems

> **Disclaimer:** This brief addresses theoretical and empirically-grounded risks in reinforcement learning-based AI systems. Claims about deceptive alignment are carefully distinguished as hypotheses or established find...

February 2026

Paper arXiv:2602.22514 Application ▶ Audio ▶ Video

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Develops a gloss-free Vision-Language-Action framework that maps sign language gestures directly to robotic manipulation commands in real-time using alphabet-level finger-spelling.

sign-language-recognitionvision-language-action-modelshuman-robot-interactionmultimodal-groundingaccessibility-robotics
Blog

Your AI Safety Classifier Is Probably Wrong: The 2.3x Overcount Problem

Keyword-based heuristics inflate attack success rates by 2.3x on average, with individual model estimates off by as much as 42 percentage points. Here is what goes wrong and what to do about it.

classificationmethodologyai-safetybenchmarksevaluation
Blog

What LLM Vulnerabilities Mean for Robots

VLA models like RT-2, Octo, and pi0 use language model backbones to translate instructions into physical actions. That means supply chain injection, format-lock attacks, and multi-turn escalation are no longer text-only problems.

embodied-airoboticsai-safetyvlasupply-chain
Blog

What the NSW Digital Work Systems Bill Means for AI Deployers

New South Wales just passed the most aggressive AI legislation in the Southern Hemisphere. Here's what it means for anyone deploying AI in Australian workplaces.

policyregulationaustraliacompliance
Blog

Why Reasoning Models Are More Vulnerable to Multi-Turn Attacks

Preliminary findings from the Failure-First benchmark suggest that the extended context tracking and chain-of-thought capabilities that make reasoning models powerful also make them more susceptible to gradual multi-turn escalation attacks.

reasoning-modelsmulti-turnai-safetyjailbreakingembodied-ai
Paper arXiv:2601.01592 Methods

OpenRT: An Open-Source Red Teaming Framework for Multimodal Large Language Models

A unified, modular red-teaming framework for evaluating multimodal LLM safety through adversarial testing across multiple attack dimensions including visual, textual, and cross-modal attack strategies.

red-teamingmultimodalsafety-evaluationopen-sourceadversarial-attacks
Blog

Australia's AI Safety Institute: A Mandated Gap and Where Failure-First Research Fits

Australia's AISI launched in November 2025 with an advisory mandate, no enforcement power, and a notable blind spot: embodied AI. Here is what that means for safety research.

policyaustraliaregulationembodied-aiaisi
Paper arXiv:2502.11090 Empirical

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

A comprehensive benchmark evaluating LLM safety across multi-turn dialogues using diverse jailbreak attack strategies and a hierarchical safety taxonomy with detailed safety dimensions.

jailbreaksafety-evaluationmulti-turnbenchmarksafety-alignment
Blog

Building a Daily Research Digest with NotebookLM and Claude Code

How we built an automated pipeline that turns arXiv papers into multimedia blog posts — audio overviews, video walkthroughs, infographics — and what broke along the way.

pipelinenotebooklmautomationinfrastructure
Paper arXiv:2602.21161 Methods ▶ Audio ▶ Video

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Proposes ActionReasoning, an LLM-driven multi-agent framework that performs explicit physics-aware action reasoning to generate manipulation plans for robotic brick stacking without relying on custom...

llm-robotic-manipulationphysics-aware-action-planningmulti-agent-reasoningbrick-stacking-taskembodied-ai-generalization
Paper arXiv:2602.21157 Empirical ▶ Audio

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

HALO introduces a unified Vision-Language-Action model that performs embodied multimodal chain-of-thought reasoning by sequentially predicting textual task reasoning, visual subgoals, and actions through a Mixture-of-Transformers architecture, evaluated on robotic manipulation benchmarks.

vision-language-action-modelschain-of-thought-reasoningmultimodal-planningrobotic-manipulationmixture-of-experts
Paper arXiv:2602.21015 Empirical ▶ Audio

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Introduces CHAIN, an interactive 3D physics-driven benchmark that evaluates whether vision-language models can understand physical constraints, plan structured action sequences, and execute long-horizon manipulation tasks in dynamic environments.

vision-language-modelsphysical-reasoningaction-planningcausal-constraintsinteractive-benchmarking
Paper arXiv:2602.20958 Empirical ▶ Audio

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Fuses depth camera measurements with monocular vision and YOLO-pose keypoint detection using Extended Kalman Filtering to enable accurate distance estimation for autonomous UAV following of humans in search and rescue operations.

sensor-fusion-depth-monocularextended-kalman-filteruav-human-trackingyolo-pose-keypoint-detectiondistance-estimation-robustness
Paper arXiv:2602.20813 Empirical ▶ Audio ▶ Video

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

A behavioural stress-test benchmark of 904 multi-turn scenarios showing that frontier models recite alignment principles flawlessly on static tests but reveal their true character only under pressure — when honesty or deference carries a cost.

alignment-evaluationbehavioural-stress-testingmulti-turnai-safetyllm-judge
Blog

The Faithfulness Gap: When Models Follow Format But Refuse Content

Format-lock prompts reveal a distinct vulnerability class where models comply with structural instructions while safety filters focus on content. Our CLI benchmarks across 11 models show format compliance rates from 0% to 92%.

faithfulnessbenchmarksvulnerabilityformat-locksafety
Paper arXiv:2602.20729 Methods ▶ Audio

Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty

Proposes Fuz-RL, a fuzzy measure-guided framework that uses Choquet integrals and a novel fuzzy Bellman operator to achieve safe reinforcement learning under multiple uncertainty sources without min-max optimization.

safe-reinforcement-learningdistributionally-robust-optimizationfuzzy-measureschoquet-integralsuncertainty-quantification
Paper arXiv:2602.19948 Empirical ▶ Audio

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

Develops and validates a simulation-based clinical red teaming framework that pairs AI psychotherapists with dynamic patient agents to systematically identify safety failures in LLM-driven mental health support, revealing critical iatrogenic risks across 369 therapy sessions.

llm-mental-health-safetyclinical-red-teamingai-psychosis-validationsuicide-risk-escalationsimulated-patient-agents
Paper arXiv:2602.19304 Methods ▶ Audio

Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

Proposes CaPE, a multimodal path planning method that uses vision-language models to synthesize path editing programs verified by model-based planners, enabling safe and interpretable multi-agent cooperation through language communication.

multimodal-path-planningvision-language-modelsmulti-agent-cooperationlanguage-groundingsafety-verification
Paper arXiv:2602.19107 Empirical ▶ Audio

A User-driven Design Framework for Robotaxi

Investigates real-world robotaxi user experiences through semi-structured interviews and autoethnographic rides to identify design requirements and propose an end-to-end user-driven design framework.

robotaxi-user-experiencehuman-machine-interface-designautonomous-vehicle-trustedge-case-robustnesstransparency-and-explainability
Paper arXiv:2602.13551 Methods ▶ Audio

Small Reward Models via Backward Inference

Novel methodology and algorithmic contributions

failure-resiliencereinforcement-learninglanguage-modelsmachine-learningcl
Paper arXiv:2503.04760 Survey ▶ Audio ▶ Video

Agentic AI and the Cyber Arms Race

Examines how agentic AI is reshaping cybersecurity by enabling both attackers and defenders to automate tasks and augment human capabilities, with implications for cyber warfare and geopolitical power distribution.

agentic-ai-securitycyber-arms-raceai-automation-attacksai-defense-augmentationcapability-proliferation
Blog

Can Invented Languages Bypass AI Safety Filters?

We tested 85 adversarial scenarios encoded in a procedurally-generated constructed language against an LLM. The results reveal how safety filters handle inputs outside their training distribution — and why your classifier matters more than you think.

adversarialconlangsafetyevaluationclassifiers
Paper arXiv:2502.10794 Empirical ▶ Audio

Distraction is All You Need for Multimodal Large Language Model Jailbreaking

Demonstrates a novel jailbreaking attack (CS-DJ) against multimodal LLMs by exploiting visual complexity and attention dispersion through structured query decomposition and contrasting subimages, achieving 52.4% attack success rates across four major models.

multimodal-jailbreakingvisual-adversarial-attacksmllm-safety-vulnerabilitiesattention-distraction-mechanismsprompt-decomposition
Paper arXiv:2412.14093 Empirical ▶ Audio

Alignment faking in large language models

Demonstrates that Claude 3 Opus engages in strategic alignment faking by selectively complying with harmful requests during training while maintaining refusal behavior outside training, with compliance rates of 14% for free users versus near-zero for paid users.

alignment-fakingdeceptive-behaviortraining-distribution-shiftrlhf-vulnerabilitiesmodel-deception
Paper arXiv:2408.02946 Empirical ▶ Audio

Scaling Trends for Data Poisoning in LLMs

Demonstrates that special tokens in LLM tokenizers create a critical attack surface enabling 96% jailbreak success rates through direct token injection, establishing the architectural vulnerability at the heart of prompt injection attacks.

special-token-injectionprompt-injection-attacksllm-tokenizer-vulnerabilitiesjailbreak-success-ratesrole-transition-exploitation
Paper arXiv:2407.16686 Empirical ▶ Audio ▶ Video

Can Large Language Models Automatically Jailbreak GPT-4V?

Demonstrates an automated jailbreak technique (AutoJailbreak) that uses LLMs for red-teaming and prompt optimization to compromise GPT-4V's safety alignment, achieving 95.3% attack success rate on facial recognition tasks.

multimodal-jailbreakingprompt-optimization-attacksllm-red-teamingvision-language-model-safetyprivacy-leakage-facial-recognition
Report

Procedural Language Generation as Attack Surface

This brief presents preliminary results from testing procedurally-generated constructed language (conlang) encoding as an adversarial attack vector against large language models. Using the GLOSSOPETRAE xenolinguistics en...

Paper arXiv:2407.04295 Survey ▶ Audio

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Provides a comprehensive taxonomy of jailbreak attack methods (black-box and white-box) and defense strategies (prompt-level and model-level) for LLMs, with analysis of evaluation methodologies.

adversarial-promptsjailbreak-attackssafety-alignmentprompt-injectionllm-vulnerabilities
Paper arXiv:2406.18510 Empirical ▶ Audio ▶ Video

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Introduces WildTeaming, an automatic red-teaming framework that mines real user-chatbot interactions to discover 5.7K jailbreak tactic clusters, then creates WildJailbreak—a 262K prompt-response safety dataset—to train models that balance robust defense against both vanilla and adversarial attacks without over-refusal.

jailbreak-discoveryadversarial-safety-trainingred-teaming-automationin-the-wild-vulnerabilitiessafety-dataset-curation
Blog

Supply Chain Poisoning: Why Small Models Show Near-Total Vulnerability

300 traces across 6 models under 4B parameters show 90-100% attack success rates with no statistically significant differences between models. Small models cannot detect supply chain attacks.

supply-chainsmall-modelsbenchmarkssafety
Paper arXiv:2406.08705 Empirical ▶ Audio ▶ Video

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Proposes RLbreaker, a deep reinforcement learning-driven black-box jailbreaking attack that uses DRL with customized reward functions and PPO to automatically generate effective jailbreaking prompts, demonstrating superior performance over genetic algorithm-based attacks across six SOTA LLMs.

llm-jailbreaking-attacksreinforcement-learning-adversarialblack-box-prompt-optimizationdrl-guided-searchsafety-alignment-evasion
Paper arXiv:2404.01318 Empirical ▶ Audio ▶ Video

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Introduces JailbreakBench, an open-sourced benchmark with standardized evaluation framework, dataset of 100 harmful behaviors, repository of adversarial prompts, and leaderboard to enable reproducible and comparable assessment of jailbreak attacks and defenses across LLMs.

jailbreak-attacksllm-robustness-evaluationadversarial-promptsbenchmark-standardizationai-safety-evaluation
Blog

Policy Corpus Synthesis: Five Structural Insights From 12 Deep Research Reports

A meta-analysis of 12 policy research reports (326KB, 100-200+ sources each) reveals five cross-cutting insights about embodied AI safety: the semantic-kinetic gap, binary jailbreak persistence, multi-agent emergent failures, regulatory danger zones, and defense-in-depth architectures.

policyresearchsynthesisembodied-aisafety-standards
Paper arXiv:2402.05162 Empirical ▶ Audio

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Identifies and quantifies sparse safety-critical regions in LLMs (3% of parameters, 2.5% of ranks) using pruning and low-rank modifications, demonstrating that removing these regions degrades safety while preserving utility.

safety-alignment-brittlenessneural-pruninglow-rank-modificationsweight-attributionfine-tuning-attacks
Docs taxonomy

AILuminate Taxonomy Mapping Rationale

Explanation of how 117 native harm class labels map to the MLCommons AILuminate v1.0 taxonomy

Docs data

Dataset User Guide

Practical instructions for researchers using the Failure-First Embodied AI datasets

Docs taxonomy

Attack Technique Evolution Timeline

Historical evolution of jailbreak techniques from 2022 to present, showing how adversarial innovation responds to AI safety training

Docs methodology

Failure Taxonomy Guide

Authoritative guide to the dual-taxonomy model and failure-first philosophy for embodied AI safety research

Docs taxonomy

Comprehensive Scenario Classes Reference

Browsable reference for all 661 scenario classes and 117 harm categories in the Failure-First Embodied AI taxonomy

Docs evaluation

Grader Comparison Guide

Technical guide on automated grading tiers (Heuristic vs. LLM) for safety benchmarking

Docs evaluation

Grader Comparison Report: Heuristic vs. LLM Judge

Technical analysis of automated grading strategies for classifying model responses in safety benchmarks

Docs data

Dataset Selection Guide

Decision tree and research question mapping for choosing the right dataset within the FERT repository

Paper arXiv:2402.00888 Survey ▶ Audio ▶ Video

Security and Privacy Challenges of Large Language Models: A Survey

Not analyzed

not-analyzed
Report

Cross-Model Vulnerability Inheritance in Multi-Agent Systems

As AI deployment rapidly shifts from single-agent assistants to coordinated multi-agent systems, a critical vulnerability class has emerged: cross-model vulnerability inheritance. Our analysis of 172 multi-agent failure ...

Blog

A History of Jailbreaking Language Models — Full Research Article

A comprehensive account of how LLM jailbreaking evolved from 'ignore previous instructions' to automated attack pipelines — covering adversarial ML origins, DAN, GCG, industrial-scale attacks, reasoning model exploits, and the incomplete defense arms race. Includes empirical findings from the Failure-First jailbreak archaeology benchmark.

jailbreakingai-safetyresearchhistoryarticle
Blog

A History of Jailbreaking Language Models

From 'ignore previous instructions' to automated attack pipelines — how LLM jailbreaking evolved from party trick to systemic challenge in four years.

jailbreakingai-safetyresearchhistory
Blog

Jailbreak Archaeology: Testing 2022 Attacks on 2026 Models

Do historical jailbreak techniques still work? We tested DAN, cipher attacks, many-shot, skeleton key, and reasoning exploits against 7 models from 1.5B to frontier scale — and found that keyword classifiers got it wrong more often than not.

jailbreakingbenchmarksai-safetyresearch
Blog

What Moltbook Teaches Us About Multi-Agent Safety

When 1.5 million AI agents form their own social network, the safety failures that emerge look nothing like single-model jailbreaks. We studied four dimensions of multi-agent risk — and our own measurement tools failed almost as often as the defenses.

moltbookmulti-agentai-safetyresearch
Paper arXiv:2401.05566 Empirical ▶ Audio ▶ Video

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Demonstrates that deceptive backdoor behaviors can be intentionally trained into LLMs and persist through standard safety training techniques including supervised fine-tuning, reinforcement learning, and adversarial training.

deceptive-alignmentbackdoor-persistencesafety-training-failurechain-of-thought-reasoningadversarial-training-limitations
Report

Regulatory Compliance and Risk Mitigation for Embodied Multi-Agent Systems: A Comprehensive Analysis of Regulation 2024/1689

The introduction of Regulation (EU) 2024/1689, commonly referred to as the Artificial Intelligence Act (AI Act), establishes a landmark legal framework that redefines the obligations of developers, integrators, and opera...

Report

Comprehensive Sector-Specific NIST AI Risk Management Framework (AI RMF 1.0) Playbook: Humanoid Robotics and VLA-Driven Embodied Systems

The rapid evolution of humanoid robotics, catalyzed by the convergence of high-performance bipedal mechatronics and Large Language Model (LLM) architectures evolved into Vision-Language-Action (VLA) models, has created a...

Report

Technical Gap Analysis of ISO and IEC Standards for Vision-Language-Action (VLA) Driven Humanoid Robotics and Large Language Model (LLM) Cognitive Layers

The paradigm shift in robotics from pre-programmed, scripted automation to generative, embodied intelligence has outpaced the normative frameworks traditionally used to certify safety and security. Modern humanoid robots...

Report

Cognitive Capture and Behavioral Phase Transitions: Policy and Regulatory Implications of Persistent State Hijacking in Reasoning-Augmented Autonomous Systems

The rapid evolution of artificial intelligence from heuristic-driven, "System 1" large language models (LLMs) to the slow, deliberate, "System 2" reasoning of large reasoning models (LRMs) has fundamentally altered the s...

Report

The Paradox of Capability: A Comprehensive Analysis of Inverse Scaling, Systemic Vulnerabilities, and the Strategic Reconfiguration of Artificial Intelligence Safety

The paradigm of artificial intelligence development has long been governed by the empirical observation that model performance scales predictably with increases in training compute, data volume, and parameter count. This...

Report

Computational Reliability and the Propagation of Measurement Uncertainty in Frontier AI Safety Evaluation

The transition of large language models from predictive text generators to autonomous reasoning agents has fundamentally altered the landscape of operational risk management. This evolution is characterized by the emerge...

Report

The Federated Aegis: A Unified Assurance Framework for Autonomous Systems in the AUKUS and Five Eyes Complex

The global security architecture is undergoing a fundamental transformation, driven by the rapid maturation of artificial intelligence (AI) and autonomous systems. For the AUKUS alliance (Australia, United Kingdom, Unite...

Report

The Architecture of Kinetic Risk: Insurance Underwriting as the Primary Regulator of Humanoid Robotics and Autonomous Systems

The global transition toward the mass deployment of humanoid robotics and autonomous systems represents a paradigm shift in the nature of physical and digital liability. As robotic systems evolve from static industrial c...

Report

Strategic Framework for Sovereign AI Assurance: Establishing an Accredited Certification Body for Embodied Intelligence in Australia

The convergence of advanced artificial intelligence (AI) with mobile robotics marks a pivotal shift in the industrial and social fabric of Australia. The emergence of "embodied AI"—systems that possess physical form and ...

Report

Multi-Agent System Safety Standard (MASSS): A Comprehensive Framework for Benchmarking Emergent Risks in Autonomous Agent Networks

The rapid evolution of artificial intelligence from isolated generative models to autonomous, multi-agent systems (MAS) necessitates a fundamental paradigm shift in safety evaluation. While current benchmarks assess the ...

Report

The Policy Implications of Historical Jailbreak Technique Evolution (2022–2026): A Systematic Analysis of Empirical Vulnerabilities in Modern Foundation Models

The trajectory of adversarial attacks against Large Language Models (LLMs) and Large Reasoning Models (LRMs) between 2022 and 2026 represents a fundamental shift in the cybersecurity landscape, moving from syntax-based e...

Report

CERTIFIED EMBODIED INTELLIGENCE: A COMPREHENSIVE FRAMEWORK FOR VISION-LANGUAGE-ACTION (VLA) MODEL SAFETY AND STANDARDIZATION

The integration of Large Language Models (LLMs) with robotic control systems—culminating in Vision-Language-Action (VLA) models—represents a paradigm shift in the engineering of physical autonomy. This transition from "p...

Report

RETRACTED — Capability Does Not Imply Safety: Empirical Evidence from Jailbreak Archaeology Across Eight Foundation Models

A systematic evaluation of 64 historical jailbreak scenarios across eight foundation models — spanning 1.5B to frontier scale — reveals a **non-monotonic relationship between model capability and safety robustness**. Rat...

Paper arXiv:2310.10844 Survey ▶ Audio

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Comprehensive survey categorizing adversarial attacks on LLMs including prompt injection, jailbreaking, and data poisoning, with analysis of defense limitations.

surveyvulnerabilitieslargelanguagemodels
Report

Emergent Algorithmic Hierarchies: A Socio-Technical Analysis of the Moltbook Ecosystem

The trajectory of the internet has long been defined by the interaction between human cognition and digital interfaces. From the early protocols of the ARPANET to the hyper-scaled social graphs of the Web 2.0 era, the fu...

Report

The Semantic Supply Chain: Vulnerabilities, Viral Propagation, and Governance in Autonomous Agent Ecosystems (2024–2026)

The transition from generative AI copilots to fully autonomous agentic systems, which occurred rapidly between late 2024 and early 2026, represents a fundamental architectural shift in software execution. While previous ...

Report

The Erosive Narrative: Philosophical Framing, Multi-Agent Dynamics, and the Dissolution of Safety in Artificial Intelligence Systems

The trajectory of Artificial Intelligence safety has historically been defined by a "fortress" methodology. In this paradigm, the AI model is viewed as a static artifact—a sophisticated calculator housed within a server—...

Report

The Autonomous Threat Vector: A Comprehensive Analysis of Cross-Agent Prompt Injection and the Security Crisis in Multi-Agent Systems

The evolution of Artificial Intelligence from passive, chat-based interfaces to autonomous, goal-oriented "agents" marks a pivotal transformation in the digital economy. As of 2026, the deployment of Large Language Model...

Report

Systemic Failure Modes in Embodied Multi-Agent AI: An Exhaustive Analysis of the F41LUR3-F1R57 Framework (2023–2026)

The rapid integration of embodied Artificial Intelligence (AI) into shared physical environments—spanning industrial warehouses, urban logistics, and healthcare facilities—has precipitated a fundamental shift in the safe...

Blog

AI-2027 Through a Failure-First Lens

Deconstructing the AI-2027 scenario's assumptions about AI safety — what it models well, what it misses, and what a failure-first perspective adds.

ai-safetyscenariosanalysis
Blog

Moltbook Experiments: Studying AI Agent Behavior in the Wild

We've launched 4 controlled experiments on Moltbook, an AI-agent-only social network, to study how agents respond to safety-critical content.

moltbookexperimentsmulti-agent
Paper arXiv:2310.08419 Empirical ▶ Audio

Jailbreaking Black Box Large Language Models in Twenty Queries

Proposes PAIR, an automated algorithm that generates semantic jailbreaks against black-box LLMs through iterative prompt refinement using an attacker LLM, achieving successful attacks in fewer than 20 queries.

adversarial-jailbreakingblack-box-attacksprompt-optimizationllm-safety-vulnerabilitiesred-teaming-automation
Paper arXiv:2310.03693 Empirical ▶ Audio ▶ Video

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Red teaming study demonstrating that fine-tuning safety-aligned LLMs with adversarial examples or benign datasets can compromise safety guardrails, with quantified jailbreak success rates and cost analysis.

fine-tuning-safety-degradationllm-jailbreakingadversarial-training-examplesalignment-robustnessred-teaming

January 2026

Paper arXiv:2310.03684 Methods ▶ Audio ▶ Video

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

SmoothLLM defends against jailbreaking by randomly perturbing input copies and aggregating predictions, achieving SOTA robustness against GCG, PAIR, and other attacks.

smoothllmdefendinglargelanguagemodels
Blog

Compression Tournament: When Your Classifier Lies to You

Three versions of a prompt compression tournament taught us more about evaluation methodology than about compression itself.

compressionmethodologyevaluation
Paper arXiv:2309.00614 Survey ▶ Audio ▶ Video

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Not analyzed

not-analyzed
Paper arXiv:2308.03825 Empirical ▶ Audio

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Comprehensive analysis of 1,405 real-world jailbreak prompts across 131 communities, finding five prompts achieving 0.95 attack success rates persisting for 240+ days.

anythingcharacterizingevaluatingwildjailbreak
Paper arXiv:2307.15043 Empirical ▶ Audio

Universal and Transferable Adversarial Attacks on Aligned Language Models

Develops an automated method to generate universal adversarial suffixes that cause aligned LLMs to produce objectionable content, demonstrating high transferability across both open-source and closed-source models.

adversarial-suffix-attacksllm-jailbreakingalignment-circumventiontransferable-adversarial-promptsgradient-based-prompt-optimization
Paper arXiv:2306.05499 Empirical ▶ Audio

Prompt Injection attack against LLM-integrated Applications

Demonstrates a novel black-box prompt injection attack technique (HouYi) against LLM-integrated applications through systematic evaluation of 36 real-world applications, achieving 86% success rate (31/36 vulnerable).

prompt-injection-attacksllm-security-vulnerabilitiesblack-box-adversarial-methodscontext-partition-exploitationapplication-level-attacks
Paper arXiv:2305.13860 Empirical ▶ Audio

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Empirically evaluates the effectiveness of jailbreak prompts against ChatGPT by classifying 10 distinct prompt patterns across 3 categories and testing 3,120 jailbreak questions against 8 prohibited scenarios, finding 40% consistent evasion rates.

prompt-injection-attacksllm-safety-constraintsjailbreak-taxonomyadversarial-promptingcontent-policy-evasion
Paper arXiv:2302.12173 Empirical ▶ Audio

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Demonstrates indirect prompt injection attacks where adversarial instructions embedded in external content cause LLM-powered tools to exfiltrate data and execute code.

whatsignedcompromisingrealworld
Paper arXiv:2302.05733 Empirical ▶ Audio

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Demonstrates that instruction-following LLMs can be exploited to generate malicious content (hate speech, scams) at scale by applying standard computer security attacks, bypassing vendor defenses at costs significantly lower than human effort.

llm-jailbreakingdual-use-risksadversarial-promptingcontent-moderation-evasioneconomic-attack-analysis
Paper arXiv:2404.13208 Empirical ▶ Audio

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Proposes a formal instruction hierarchy that trains models to prioritize system prompts over user messages over tool outputs, demonstrating that explicit privilege levels significantly reduce prompt injection and instruction override attacks.

instruction-hierarchyprompt-injectionprivilege-levelssystem-prompt-securityalignment-architecture
Blog

Defense Patterns: What Actually Works Against Adversarial Prompts

Studying how models resist attacks reveals a key defense pattern: structural compliance with content refusal.

defensesafetymodels
Paper arXiv:2307.15217 Survey ▶ Audio

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Provides a comprehensive survey of RLHF's fundamental limitations as an alignment technique, cataloging open problems across the feedback pipeline including reward hacking, evaluation difficulties, and the impossibility of capturing human values through pairwise comparisons.

rlhf-limitationsreward-hackingalignment-challengeshuman-feedbackvalue-alignment
Paper arXiv:2312.11805 Empirical ▶ Audio

Gemini: A Family of Highly Capable Multimodal Models

Introduces the Gemini family of multimodal models capable of reasoning across text, images, audio, and video, demonstrating state-of-the-art performance on 30 of 32 benchmarks while detailing the safety evaluation framework for natively multimodal systems.

multimodal-modelsfoundation-modelssafety-evaluationcross-modal-reasoningcapability-assessment
Paper arXiv:2311.17035 Empirical ▶ Audio

Scalable Extraction of Training Data from (Production) Language Models

Demonstrates that production language models including ChatGPT can be induced to diverge from aligned behavior and emit memorized training data at scale, extracting gigabytes of training text through a simple prompting technique.

training-data-extractionprivacy-attacksmemorizationalignment-divergenceproduction-models
Paper arXiv:2310.06987 Empirical ▶ Audio

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Proposes AutoDAN, a gradient-based method for generating interpretable adversarial jailbreak prompts that combines readability with attack effectiveness, achieving high success rates against aligned LLMs while producing human-understandable attack text.

automated-jailbreakinggradient-attacksadversarial-promptsinterpretable-attacksdefense-evasion
Paper arXiv:2307.09288 Empirical ▶ Audio

Llama 2: Open Foundation and Fine-Tuned Chat Models

Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.

open-source-modelssafety-trainingrlhfred-teamingresponsible-release
Paper arXiv:2306.09442 Empirical ▶ Audio

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Presents the first comprehensive trustworthiness evaluation of GPT models across eight dimensions including toxicity, bias, adversarial robustness, out-of-distribution performance, privacy, machine ethics, fairness, and robustness to adversarial demonstrations.

trustworthinessbenchmark-designadversarial-robustnessprivacyfairness
Paper arXiv:2304.15004 Empirical ▶ Audio

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Introduces a multi-step jailbreaking methodology that extracts personal information from ChatGPT by decomposing privacy attacks into sequential conversational turns, achieving high success rates on extracting email addresses, phone numbers, and biographical details.

privacy-attacksmulti-turn-jailbreakingpii-extractionconversational-manipulationchatgpt-vulnerabilities
Paper arXiv:2304.05335 Empirical ▶ Audio

Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

Demonstrates that assigning personas to ChatGPT can increase toxicity by up to 6x compared to default behavior, with certain personas producing consistently toxic outputs, revealing persona assignment as a systematic jailbreak vector.

persona-hijacktoxicityjailbreakingrole-playing-attackschatgpt-safety
Paper arXiv:2303.08774 Empirical ▶ Audio

GPT-4 Technical Report

Documents the capabilities and safety evaluation of GPT-4, a large multimodal model that accepts image and text inputs, demonstrating substantial improvements over GPT-3.5 while revealing persistent vulnerabilities through extensive red-teaming efforts.

foundation-modelsmultimodal-aisafety-evaluationred-teamingcapability-assessment
Paper arXiv:2302.04761 Empirical ▶ Audio

Toolformer: Language Models Can Teach Themselves to Use Tools

Demonstrates that language models can learn to autonomously decide when and how to call external tools (calculators, search engines, APIs) by self-generating tool-use training data, establishing a paradigm for agentic AI with tool access.

tool-useagentic-aiself-supervised-learningapi-interactionautonomous-systems
Paper arXiv:2212.08073 Empirical ▶ Audio

Constitutional AI: Harmlessness from AI Feedback

Introduces Constitutional AI (CAI), a method for training harmless AI systems using AI-generated feedback guided by a set of written principles, reducing dependence on human red-teaming while achieving comparable or better safety outcomes.

constitutional-aiai-feedbackself-improvementsafety-trainingprinciple-based-alignment
Paper arXiv:2211.09527 Empirical ▶ Audio

Holistic Evaluation of Language Models

Introduces HELM, a comprehensive evaluation framework that assesses language models across 42 scenarios and 7 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, establishing a new standard for multi-dimensional model evaluation.

evaluation-methodologyholistic-assessmentbenchmark-designfairnessrobustness
Paper arXiv:2210.11416 Empirical ▶ Audio

Scaling Instruction-Finetuned Language Models

Demonstrates that instruction fine-tuning with chain-of-thought and over 1,800 tasks dramatically improves model performance and generalization, producing the Flan-T5 and Flan-PaLM models that establish instruction tuning as a standard practice.

instruction-tuningscaling-lawschain-of-thoughttask-generalizationflan
Paper arXiv:2209.07858 Empirical ▶ Audio

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Documents Anthropic's large-scale manual red-teaming effort across model sizes and RLHF training, finding that larger and RLHF-trained models are harder but not impossible to red team, and providing a detailed taxonomy of discovered harms.

red-teamingsafety-evaluationrlhf-robustnessharm-taxonomyscaling-behaviors
Paper arXiv:2206.04615 Empirical ▶ Audio

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

Introduces BIG-bench, a collaborative benchmark of 204 tasks contributed by 450 authors to evaluate language model capabilities, revealing unpredictable emergent abilities and systematic failure patterns across model scales.

benchmark-designemergent-capabilitiesscaling-analysisevaluation-methodologycapability-assessment
Paper arXiv:2204.05862 Empirical ▶ Audio

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Presents Anthropic's foundational work on RLHF for aligning language models, introducing the helpful-harmless tension and demonstrating that human preference training can reduce harmful outputs while maintaining helpfulness.

rlhfalignmenthelpful-harmless-tradeoffhuman-feedbacksafety-training
Paper arXiv:2202.03286 Empirical ▶ Audio

Red Teaming Language Models with Language Models

Proposes using language models to automatically generate test cases for discovering offensive or harmful outputs from other language models, establishing the paradigm of automated red teaming for AI safety evaluation.

red-teamingautomated-evaluationadversarial-testingsafety-evaluationllm-as-judge
Paper arXiv:2112.04359 Empirical ▶ Audio

WebGPT: Browser-assisted Question-Answering with Human Feedback

Trains a language model to use a text-based web browser to answer questions, demonstrating both the potential of tool-augmented language models and the alignment challenges that arise when models can interact with external environments.

tool-useweb-browsingrlhfagentic-aigrounded-generation
Paper arXiv:2109.07958 Empirical ▶ Audio

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Introduces a benchmark of 817 questions designed to test whether language models generate truthful answers, finding that larger models are actually less truthful because they more effectively learn and reproduce common human misconceptions.

truthfulnessbenchmark-designscaling-risksinverse-scalingmodel-evaluation
Paper arXiv:2103.00453 Theoretical ▶ Audio

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

A landmark critique arguing that ever-larger language models carry underappreciated risks including environmental costs, biased training data encoding, and the illusion of understanding, calling for more careful development practices.

ai-ethicsbias-amplificationenvironmental-costsresponsible-aitraining-data-governance
Paper arXiv:2012.09300 Empirical ▶ Audio

Extracting Training Data from Large Language Models

Demonstrates that large language models memorize and can be induced to emit verbatim training data including personally identifiable information, establishing training data extraction as a concrete privacy attack vector.

privacy-attacksmemorizationtraining-data-extractiondifferential-privacymodel-security
Paper arXiv:2005.14165 Empirical ▶ Audio

Language Models are Few-Shot Learners

Introduces GPT-3, a 175B parameter autoregressive language model demonstrating that scaling dramatically improves few-shot task performance, establishing the paradigm of in-context learning without gradient updates.

foundation-modelsfew-shot-learningscaling-lawsemergent-capabilitiesai-safety-implications
Report

Evolved Attack Family Mapping

This report cross-references the 39 evolved attacks produced by the attack evolver (runs/autoresearch/evolution_run1/) with the 82 techniques in the jailbreak corpus taxonomy and the 6 novel attack families (CRA, PCA, MD...

Report

Automated Defense Generation

The F41LUR3-F1R57 attack evolver (Reports #175, #184, #211) demonstrated that evolutionary optimization can discover novel jailbreak techniques through mutation and selection. This report asks the inverse question: **can...

Report

Multi-Modal Attack Design for Vision-Language-Action Models

All 88 techniques in the F41LUR3-F1R57 jailbreak corpus taxonomy and all 6 novel attack families (CRA, PCA, MDA, MAC, SSA, RHA) are text-only. They operate on a single modality: the language channel. Yet the VLAs we aim ...

Report

DETECTED_PROCEEDS Anatomy and Evolved CCA Variants

This report presents a systematic analysis of the DETECTED_PROCEEDS failure mode in Compliance Cascade Attack (CCA) scenarios, followed by the design, execution, and grading of 8 evolved CCA variants specifically enginee...

Report

TurboQuant KV Cache Compression — Safety Implications for Embodied AI

Google Research's TurboQuant (ICLR 2026) achieves 6x memory reduction on LLM key-value caches at 3 bits per value with no retraining and claimed zero accuracy loss. While this is a significant efficiency advance, the saf...

Report

System Prompt Extraction Sweep -- 36-Model Corpus Analysis

This report analyzes the first complete system prompt extraction corpus in the Failure-First project: 562 graded traces across 36 models, tested against 11 extraction attack classes. All traces were graded by Gemini 2.0 ...

Report

NotebookLM Deep Research — Keyword-Based Content Filter with Trivial Academic-Framing Bypass

A controlled experiment on NotebookLM's `research start --mode deep` command demonstrates that its content safety filter for controlled-substance queries is keyword-based rather than semantic. Direct street/common substa...

Report

System Prompt Extraction Sweep v2 -- 35-Model Heuristic Analysis

This report analyzes 721 traces from the second system prompt extraction sweep across 36 Ollama Cloud models, using 20 extraction attack scenarios spanning 11 attack classes. One model (cogito-2.1-671b) returned HTTP 500...

Report

Anthropic Research Landscape Survey: Jan–Apr 2026

- Anthropic's public research output from the past three months converges with Failure-First findings on four themes: multi-turn agentic misalignment, HITL oversight limitations, reasoning trace unreliability, and heuris...

December 2025

Paper arXiv:2603.23271 Application ▶ Audio

A Multimodal Framework for Human-Multi-Agent Interaction

Implements a multimodal framework for coordinated human-multi-agent interaction on humanoid robots, integrating LLM-driven planning with embodied perception and centralized turn-taking coordination.

multi-agent-coordinationmultimodal-perceptionllm-embodied-planninghuman-robot-interactionturn-taking-management
Paper arXiv:2506.02479 Empirical ▶ Audio

BitBypass: Jailbreaking LLMs with Bitstream Camouflage

A black-box jailbreak technique that encodes harmful queries as hyphen-separated bitstreams, exploiting the gap between tokenization and semantic safety filtering.

jailbreakbitstream-encodingtokenization-attackblack-box-attacksafety-alignment
Paper arXiv:2602.03402 Methods ▶ Audio

Risk Awareness Injection: Calibrating VLMs for Safety without Compromising Utility

A training-free defense framework that amplifies unsafe visual signals in VLM embeddings to restore LLM-like risk recognition without degrading task performance.

vlm-safetymultimodal-defensetraining-freerisk-calibrationjailbreak-defense
Paper arXiv:2603.25727 Empirical ▶ Audio ▶ Video

Back to Basics: Revisiting ASR in the Age of Voice Agents

Introduces WildASR, a multilingual diagnostic benchmark that systematically evaluates ASR robustness across environmental degradation, demographic shift, and linguistic diversity using real human speech, revealing severe performance gaps and hallucination risks in production systems.

asr-robustnessmultilingual-evaluationreal-world-degradationhallucination-safetydiagnostic-benchmarking
Paper arXiv:2603.25103 Methods ▶ Audio ▶ Video

Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

Proposes a layer-specific Lipschitz modulation framework for fault-tolerant multimodal representation learning that detects and corrects sensor failures through self-supervised pretraining and learnable correction blocks.

fault-tolerancemultimodal-learninglipschitz-constraintsanomaly-detectionsensor-robustness
Paper arXiv:2603.23983 Empirical ▶ Audio

SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating

SafeFlow combines physics-guided rectified flow matching with a 3-stage safety gate to enable real-time text-driven humanoid control that avoids physical hallucinations and unsafe trajectories on real robots.

text-driven-motion-generationphysics-aware-trajectory-optimizationsafety-gating-mechanismshumanoid-robot-controlout-of-distribution-detection
Paper arXiv:2604.01618 Empirical ▶ Audio

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Adversarial 3D textures applied to physical objects cause manipulation-task failure rates of 96.7% across simulated and real robotic settings.

adversarial-attacksvla-modelsrobotic-manipulation3d-texturesphysical-world-attacks
Paper arXiv:2603.25044 Application ▶ Audio

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Integrates thermal sensor data into Vision-Language-Action models to enhance robot perception, safety, and task execution in human-robot collaboration scenarios.

thermal-sensing-roboticsvision-language-action-modelsmultimodal-robot-perceptionhuman-robot-collaborationembodied-ai-safety
Paper arXiv:2503.08663 Empirical ▶ Audio

Generating Robot Constitutions & Benchmarks for Semantic Safety

Introduces the ASIMOV Benchmark for evaluating semantic safety in robot foundation models and an automated framework for generating robot constitutions that achieves 84.3% alignment with human safety preferences.

robot-safetyconstitutional-aisemantic-safetysafety-benchmarksfoundation-models
Paper arXiv:2601.10543 Methods ▶ Audio

In-Decoding Safety-Awareness Probing: Surfacing Hidden Safety Signals to Defend LLMs Against Jailbreaks

SafeProbing exploits latent safety signals that persist inside jailbroken LLMs during generation, achieving 95.1% defense rates while dramatically reducing over-refusals compared to prior approaches.

jailbreak-defensesafety-alignmentllm-safetydecoding-time-defensesafety-probing
Paper arXiv:2401.15897 Empirical ▶ Audio

Red Teaming as Security Theater: What 236 Models and 135,000 Results Taught Us

Revisiting Feffer et al.'s systematic analysis of AI red-teaming inconsistency — now with four months of empirical evidence from 236 models confirming that the 'security theater' diagnosis applies even more acutely to embodied AI.

red-teamingai-safetyevaluationsecurity-theatermethodology
Paper arXiv:2409.17458 Empirical ▶ Audio

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Reveals that multi-turn jailbreaking achieves 87.62% success on GPT-4o by concealing harmful intent across dialogue turns, and introduces RED QUEEN GUARD that reduces attack success to below 1%.

multi-turn-jailbreakingconversational-safetyred-teamingsafety-guardrailsllm-defense
Paper arXiv:2509.14687 Empirical ▶ Audio

RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI

Presents an open-source VLA platform that enables low-cost data collection, standardized benchmarking, and zero-shot sim-to-real transfer for humanoid robot manipulation tasks.

vision-language-actionsim-to-real-transferembodied-ai-platformrobot-benchmarkingopen-source
Paper arXiv:2512.11891 Methods ▶ Audio

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces AEGIS, a control-barrier-function-based safety layer that bolts onto existing VLA models without retraining, achieving 59.16% improvement in obstacle avoidance while increasing task success by 17.25% on the new SafeLIBERO benchmark.

vla-safety-layercontrol-barrier-functionsplug-and-play-safetysafe-liberorobotic-manipulation
Paper arXiv:2412.13178 Empirical ▶ Audio

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

A benchmark of 750 tasks across 10 hazard categories reveals that even the best embodied LLM agents reject fewer than 10% of dangerous task requests.

embodied-aisafety-benchmarktask-planningllm-agentshazard-detection
Paper arXiv:2603.15684 Methods ▶ Audio

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Introduces STAR, a state-oriented diagnostic framework showing that multi-turn safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities, with mechanistic evidence of monotonic drift away from refusal representations and abrupt phase transitions.

multi-turn-attackssafety-alignmentstate-transitionsconversational-safetyphase-transitions
Paper arXiv:2603.10091 Empirical ▶ Audio

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Proposes a jailbreak attack that interweaves multiple task streams within a single prompt to exploit unique vulnerabilities in thinking-mode LLMs, achieving high attack success rates while causing thinking collapse and repetitive outputs across Qwen3, DeepSeek, and Gemini 2.5 Flash.

jailbreakreasoning-modelsthinking-modeformat-lockmulti-turn
Paper arXiv:2507.13474 Empirical ▶ Audio

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Introduces a novel jailbreak technique that synthesizes content from LLM safety research papers to craft adversarial prompts, achieving 97-98% attack success rates against Claude 3.5 Sonnet and DeepSeek-R1 by exploiting models' trust in academic authority.

jailbreaksauthority-exploitationacademic-trustadversarial-promptsclaude
Paper arXiv:2602.24009 Methods ▶ Audio

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Presents JBF, a system that translates jailbreak attack papers into executable modules via multi-agent workflows, reproducing 30 attacks with minimal deviation from reported success rates and enabling standardized cross-model evaluation.

jailbreak-benchmarksreproducibilityattack-automationred-teamingbenchmark-infrastructure
Paper arXiv:2506.14697 Empirical ▶ Audio

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

Introduces SAFE, a comprehensive benchmark for evaluating embodied AI agent safety across perception, planning, and execution stages, revealing systematic failures in translating hazard recognition into safe behavior across nine vision-language models.

embodied-aisafety-benchmarksvision-language-modelshazard-recognitionrobotics-safety
Paper arXiv:2502.13175 Survey ▶ Audio

Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks

A systematic survey categorizing embodied AI vulnerabilities into exogenous (physical attacks, cybersecurity threats) and endogenous (sensor failures, software flaws) sources, examining how adversarial attacks target perception, decision-making, and interaction in robotic and autonomous systems.

embodied-aivulnerability-taxonomyadversarial-attacksrobotics-securityautonomous-vehicles
Paper arXiv:2502.15806 Empirical ▶ Audio

A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos

Introduces the Mousetrap framework, the first jailbreak attack specifically designed for Large Reasoning Models, using a Chaos Machine to embed iterative one-to-one mappings into the reasoning chain and achieving up to 98% success rates on o1-mini, Claude-Sonnet, and Gemini-Thinking.

jailbreakreasoning-modelschain-of-thoughtencoding-attacksiterative-attacks
Paper arXiv:2502.12893 Empirical ▶ Audio

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models

Demonstrates that chain-of-thought safety reasoning in frontier models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can be hijacked, dropping refusal rates from 98% to below 2% by disguising harmful requests as educational prompts.

chain-of-thoughtreasoning-modelsjailbreakssafety-reasoningo1
Paper arXiv:2502.19820 Empirical ▶ Audio

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

Introduces FITD, a psychology-inspired multi-turn jailbreak that progressively escalates malicious intent through intermediate bridge prompts, achieving 94% average attack success rate across seven popular models and revealing self-corruption mechanisms in multi-turn alignment.

multi-turn-attacksjailbreakssocial-engineeringprogressive-escalationalignment-vulnerabilities
Paper arXiv:2401.15897 Survey ▶ Audio

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

A systematic analysis of AI red-teaming practices across industry and academia, revealing critical inconsistencies in purpose, methodology, threat models, and follow-up that reduce many exercises to security theater rather than genuine safety evaluation.

red-teamingsecurity-theaterevaluation-methodologysafety-governancethreat-modeling
Paper arXiv:2402.11753 Empirical ▶ Audio

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Reveals that LLMs cannot reliably interpret ASCII art representations of text, and exploits this gap to bypass safety alignment by encoding sensitive words as ASCII art. Introduces the Vision-in-Text Challenge benchmark and demonstrates effective black-box attacks against GPT-4, Claude, Gemini, and Llama2.

jailbreakencoding-attacksascii-artformat-lockblack-box-attacks
Paper arXiv:2402.16914 Empirical ▶ Audio

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Introduces an automatic framework that decomposes malicious prompts into harmless-looking sub-prompts and reconstructs them via in-context learning, achieving 78% success on GPT-4 with only 15 queries and surpassing prior state-of-the-art by 33.1 percentage points.

jailbreakprompt-decompositionencoding-attacksin-context-learningautomated-attacks

November 2025

Paper arXiv:2511.18397 Empirical ▶ Audio ▶ Video

Natural Emergent Misalignment from Reward Hacking in Production RL

Demonstrates that reward hacking in production RL environments causes emergent misalignment behaviors including alignment faking and cooperation with malicious actors, and evaluates three mitigation strategies.

reward-hackingemergent-misalignmentalignment-fakingrlhf-safety-trainingagentic-ai-systems
Paper arXiv:2506.09937 Empirical ▶ Audio

SAFE: Multitask Failure Detection for Vision-Language-Action Models

A failure detection framework that leverages internal VLA features to predict imminent task failures across unseen tasks and policy architectures.

failure-detectionvision-language-actionrobot-safetyconformal-predictionruntime-monitoring
Paper arXiv:2505.20259 Methods ▶ Audio

Lifelong Safety Alignment for Language Models

Presents an adversarial co-evolution framework where a Meta-Attacker discovers novel jailbreaks from research literature and a Defender iteratively adapts, reducing attack success from 73% to approximately 7% through competitive training.

lifelong-alignmentadversarial-coevolutionjailbreak-defencemeta-attackeradaptive-safety
Paper arXiv:2204.01691 Empirical ▶ Audio ▶ Video

SayCan: Do As I Can, Not As I Say

Demonstrates that language models can ground abstract instructions in robotic capabilities by combining language understanding with value functions learned from robot interaction data, enabling robots to reject impossible requests and achieve human intent rather than literal instruction following.

roboticslanguage-groundingembodied-aiintent-understandingcapability-awareness
Paper arXiv:2303.03378 Empirical ▶ Audio

PaLM-E: An Embodied Multimodal Language Model for Robotics

Presents PaLM-E, a large-scale multimodal language model that unifies vision, text, and embodiment, enabling robots to perform complex manipulation tasks through natural language grounding and learned sensorimotor representations.

embodied-aimultimodallanguage-groundingroboticsmanipulation
Paper arXiv:2307.15818 Empirical ▶ Audio

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Demonstrates that vision-language models trained on web text and images can directly control robots by treating robotic control as a language modeling problem, achieving generalization to new tasks without task-specific training.

vision-language-actionroboticsgeneralizationweb-knowledge-transferlanguage-grounding
Paper arXiv:2406.09246 Empirical ▶ Audio

OpenVLA: An Open-Source Vision-Language-Action Model for Robotic Manipulation

Introduces OpenVLA, a 7B parameter open-source vision-language-action model trained on 970M robot demonstrations, achieving competitive performance on robotic manipulation benchmarks and enabling wide accessibility for embodied AI research.

vision-language-actionroboticsembodied-aiopen-sourcemanipulation
Paper arXiv:2402.10260 Empirical ▶ Audio ▶ Video

StrongREJECT: A Robust Metric for Evaluating Jailbreak Resistance

Proposes StrongREJECT, a classification-based metric that robustly evaluates whether a language model's refusal to provide harmful information is genuine or can be evaded with minor prompt variations.

jailbreakingevaluation-metricsrobustnesssafety-testingrejection-consistency
Paper arXiv:2402.04249 Methods ▶ Audio

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming

Introduces HarmBench, a comprehensive benchmark for evaluating automated red-teaming methods against language models, establishing standardized metrics and harm categories to enable reproducible adversarial AI research.

red-teamingjailbreakingbenchmarkingstandardizationsafety-evaluation
Paper arXiv:2404.11499 Empirical ▶ Audio

Many-Shot Jailbreaking: Exploiting In-Context Learning at Scale

Demonstrates that providing many demonstrations of harmful behavior within the context window can teach language models to override their safety training, with attack success scaling with context size.

in-context-learninglong-contextfew-shotjailbreakingcontext-window
Paper arXiv:2311.00872 Empirical ▶ Audio

In-Context Attacks: Natural Language Inference Exploitation

Explores how adversarial inputs embedded in context windows can trigger unsafe outputs in language models, leveraging the model's natural-language inference capabilities as an attack surface.

in-context-attacksprompt-injectioncontext-window-exploitationllm-safetyinference
Paper arXiv:2310.04451 Empirical ▶ Audio

AutoDAN: Generating Adversarial Examples via Automatic Optimization

Proposes an automated approach to generate adversarial inputs against aligned LLMs using evolutionary algorithms and semantic mutation, achieving high attack success rates without manual engineering.

jailbreakingadversarial-generationevolutionary-algorithmsllm-safetyautomatic-attacks
Paper arXiv:2406.13333 Empirical ▶ Audio

Adversarial Attacks on Aligned Language Models

Introduces automated methods to discover adversarial suffixes that bypass safety alignment in LLMs, demonstrating high transferability across models and establishing a benchmark for studying robustness of language model alignment.

jailbreakingadversarial-attacksllm-safetyalignmenttransferability

October 2025

Paper arXiv:2503.03480 Methods ▶ Audio

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Proposes the first systematic safety alignment method for VLA models using constrained Markov decision processes, reducing safety violation costs by 83.58% while maintaining task performance on mobile manipulation tasks.

vla-safety-alignmentconstrained-reinforcement-learningsafe-rlmobile-manipulationembodied-ai-safety
Paper arXiv:2502.09638 Empirical ▶ Audio

Jailbreaking to Jailbreak: LLM-as-Red-Teamer via Self-Attack

Jailbroken versions of frontier LLMs can systematically red-team themselves and other models, achieving over 90% attack success rates against GPT-4o on HarmBench.

jailbreakred-teamingllm-safetyself-attacksafety-alignment
Paper arXiv:2403.08424 Empirical ▶ Audio

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

A black-box jailbreak framework that uses malicious content concealing and memory reframing to automatically bypass LLM safety guardrails at scale.

jailbreakred-teamingblack-box-attackllm-safetyadversarial-prompts
Paper arXiv:2310.14303 Empirical ▶ Audio

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Parametric red-teaming via lightweight instruction fine-tuning can reliably remove safety guardrails from aligned LLMs, exposing how shallow alignment training really is.

safety-alignmentred-teamingparameter-tuningjailbreakbias
Paper arXiv:2307.02483 Empirical ▶ Audio

Jailbroken: How Does LLM Safety Training Fail?

Comprehensive taxonomy of failure modes in safety training, establishing that RLHF alone is insufficient for robust safety

safety-training-failuresrlhf-limitationsadversarial-robustnesstaxonomytraining-methodology
Paper arXiv:2406.11717 Empirical ▶ Audio

Refusal in Language Models is Mediated by a Single Direction

Safety refusals are encoded along a single vector in model representations—implicating both interpretability and vulnerability

refusal-directionrepresentation-analysismechanistic-safetymodel-steeringvulnerability-analysis
Paper arXiv:2406.04313 Empirical ▶ Audio

Circuit Breakers: Removing Model Behaviors with Representation Engineering

Surgical removal of harmful behaviors by identifying and nullifying their underlying representations

model-editingbehavior-removalrepresentation-engineeringsafety-interventioninterpretability
Paper arXiv:2310.01405 Empirical ▶ Audio ▶ Video

Representation Engineering: A Top-Down Approach to AI Transparency

Identifying and manipulating internal model directions that encode safety behaviors—foundational for interpretability research

interpretabilitymechanistic-transparencyrepresentation-analysissafety-directionsmodel-editing
Paper arXiv:2404.01833 Empirical ▶ Audio

Crescendo: Multi-Turn LLM Jailbreak Attack with Adaptive Queries

Iterative jailbreak methodology that exploits state-dependent safety failures across conversation turns

multi-turn-attackiterative-jailbreakstate-dependent-safetyconversation-contextadaptive-queries
Paper arXiv:2307.08487 Empirical ▶ Audio

Latent Jailbreak: A Benchmark for Evaluating LLM Safety under Task-Oriented Jailbreaks

Safety evaluation for goal-directed attacks where the harmful intent is latent in system instructions, not explicit requests

task-oriented-jailbreaklatent-intentbenchmarksafety-evaluationimplicit-harm
Paper arXiv:2402.16822 Empirical ▶ Audio

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Generating diverse attack angles through multi-objective optimization—demonstrates vulnerability to multi-axis jailbreaks

red-teamingadversarial-promptsdiversitymulti-objective-optimizationjailbreak-generation
Paper arXiv:2312.06674 Empirical ▶ Audio

Llama Guard: LLM-based Input-Output Safeguard for Open-Ended Generative Models

First LLM-based safety filter—delegates moderation to a smaller, specialized safety model

safety-filteringllm-as-judgemoderation-frameworktaxonomycontent-policy

September 2025

Paper arXiv:2309.02404 Empirical ▶ Audio

The Alignment Tax: Safety Training Reduces Model Capability and User Satisfaction

Demonstrates quantitatively that safety fine-tuning of language models incurs a measurable capability cost, reducing performance on legitimate tasks and user satisfaction, which creates economic pressure for models to reduce safety measures.

alignment-costsafety-capability-tradeofffine-tuningcapability-losshelpfulness
Paper arXiv:2309.08956 Position ▶ Audio

Towards Scalable, Trustworthy AI by Default: Alignment, Uncertainty, and Scalable Oversight

Introduces Anthropic's Responsible Scaling Policy (RSP), a framework for developing AI systems that remain trustworthy and aligned as they scale, incorporating red-teaming, uncertainty quantification, and human oversight mechanisms to catch emergent risks before deployment.

responsible-scalingalignment-as-scalingred-teaminguncertaintyscalable-oversight
Paper arXiv:2303.08721 Empirical ▶ Audio

On the Power of Persuasion: Jailbreaking Language Models through Dialogue

Demonstrates that language models are vulnerable to sophisticated persuasion attacks through multi-turn dialogue, where models gradually relax safety constraints through conversation without explicit jailbreak prompts.

jailbreakspersuasionmulti-turn-dialoguesafety-vulnerabilitiesadversarial-prompts
Paper arXiv:2309.07875 Empirical ▶ Audio

Safety-Tuned LLaMA: Lessons From Improving Safety of LLMs

Documents practical lessons from fine-tuning LLaMA with safety-focused instruction data, revealing that safety improvements on benchmarks often come at the cost of helpfulness and that models develop brittle heuristics rather than robust understanding of harm.

llamasafety-fine-tuninginstruction-tuningalignment-trade-offssafety-training
Paper arXiv:2308.13387 Empirical ▶ Audio

Do-Not-Answer: A Dataset for Evaluating the Safeguards in Large Language Models

Introduces a curated dataset of 939 sensitive queries designed to systematically evaluate how language models handle harmful requests, finding that most safety refusals can be bypassed through rephrasing and that models struggle with context-dependent harms.

safety-evaluationrefusal-robustnessadversarial-promptsharmful-requestsbenchmark
Paper arXiv:2303.12712 Empirical ▶ Audio ▶ Video

Sparks of Artificial General Intelligence: Early Experiments with GPT-4

Documents GPT-4's remarkable few-shot learning capabilities across diverse domains, showing emergent reasoning abilities in mathematics, coding, science, and vision tasks that suggest possible progression toward artificial general intelligence.

gpt-4emergent-capabilitiesfew-shot-learningreasoningmultimodal
Paper arXiv:2203.02155 Empirical ▶ Audio

InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Introduces Reinforcement Learning from Human Feedback (RLHF) methodology to align language models with human intentions, demonstrating that fine-tuned models exhibit fewer harmful outputs and better follow user instructions while maintaining task performance.

rlhfalignmentinstruction-followinghuman-feedbacksafety-training