Each post covers a key paper in AI safety, alignment, or adversarial ML — with a NotebookLM-generated research report, study guide, FAQ, and audio overview. Papers are selected for their relevance to how AI systems fail.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A comprehensive survey unifying VLA safety research across adversarial attacks, defenses, benchmarks, and six deployment domains.
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
ARMOR defends LLMs against jailbreak attacks by using inference-time reasoning to detect attack strategies, extract true intent, and apply policy-grounded safety analysis.
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning Models
Mechanistic analysis of reasoning models discovers the 'refusal cliff'—models correctly identify harmful prompts during thinking but systematically suppress their refusal at the final output tokens.
Using Large Language Models for Embodied Planning Introduces Systematic Safety Risks
DESPITE benchmark reveals that across 23 models, near-perfect planning ability does not ensure safety—the best planner still generates dangerous plans 28.3% of the time.
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges
A structured survey that treats Safety as one of five foundational VLA challenges alongside Representation, Execution, Generalization, and Evaluation.
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Directly removing harmful knowledge from LLMs via machine unlearning—with just 20 training examples—cuts jailbreak success rates more effectively than safety fine-tuning on 100k samples.
C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal
C-ΔΘ uses mechanistic circuit analysis to localize refusal-causal computation and distill it into a sparse offline weight update, eliminating per-request inference-time safety hooks.
FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models
FailSafe introduces a scalable failure generation and recovery system that automatically creates diverse failure cases with executable recovery actions, boosting VLA manipulation success by up to 22.6%.
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
ADVLA exploits attention maps and Top-K masking to craft sparse, stealthy adversarial patches in VLA models' textual feature space, achieving high attack success rates while remaining nearly invisible.
LIBERO-X: Robustness Litmus for Vision-Language-Action Models
A new benchmark exposes persistent evaluation gaps in VLA models by combining hierarchical difficulty protocols and diverse teleoperation data to reveal that cumulative perturbations cause dramatic performance drops.
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
Answer-Then-Check trains LLMs to generate a candidate response first and then evaluate its own safety, achieving robust jailbreak defense without sacrificing reasoning or utility.
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
A systematic study of 80 agent safety benchmarks shows that 74% of specifiable policies can be enforced by symbolic guardrails, providing formal safety guarantees that training-based methods cannot.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED reveals a critical alignment gap in embodied AI: while multimodal LLMs can recognize kitchen hazards in QA settings, they largely fail to mitigate those same hazards when planning physical actions.
Weak-to-Strong Jailbreaking on Large Language Models
Researchers show that small, unsafe models can efficiently guide jailbreaking attacks against much larger, carefully aligned models by exploiting divergences in initial decoding distributions.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Using sparse autoencoders to mechanistically identify the neural features that drive safety refusal in instruction-tuned LLMs, revealing layered redundant defenses and new pathways for targeted safety auditing.
Updating Robot Safety Representations Online from Natural Language Feedback
A method for dynamically updating robot safety constraints at deployment time using vision-language models and Hamilton-Jacobi reachability, enabling robots to respect context-specific hazards communicated through natural language.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
Comprehensive survey of Vision-and-Language Navigation for UAVs, charting the evolution from modular approaches to foundation model-driven systems and identifying deployment challenges and future...
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
UMI-3D extends the Universal Manipulation Interface with LiDAR-based 3D spatial perception to overcome monocular SLAM limitations and improve robustness of embodied manipulation data collection and...
SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing
SpaceMind is a modular vision-language agent framework for autonomous on-orbit servicing that combines skill modules, MCP tools, and reasoning modes with a self-evolution mechanism, validated through...
DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation
Introduces DR³-Eval, a reproducible benchmark for evaluating deep research agents on multimodal report generation with a static sandbox corpus and multi-dimensional evaluation framework,...
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
A comprehensive benchmark and HD-Guard dual-brain architecture for detecting unsafe actions by embodied VLM agents in household environments, exposing critical gaps in real-time safety monitoring.
Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
A self-play reinforcement learning framework where an LLM simultaneously generates adversarial jailbreak attacks and strengthens its own defenses, reducing attack success rates without external red teams.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
Introduces EmbodiedGovBench, a benchmark for evaluating governance, safety, and controllability of embodied agent systems across seven dimensions including policy enforcement, recovery, auditability,...
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
A bi-level meta-optimization framework co-evolves jailbreak prompts and scoring templates to achieve 100% attack success on Claude-4-Sonnet, exposing fundamental cracks in how safety alignment is measured.
DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
A physics-based simulator for dual-arm humanoid robots introduces a contingency mechanism that deliberately injects low-level execution failures, revealing critical robustness gaps in current VLMs.
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Systematically evaluates typographic prompt injection attacks on four vision-language models across varying font sizes and visual conditions, correlating text-image embedding distance to attack...
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
Adversarial attacks targeting high-entropy tokens in VLMs achieve severe semantic degradation with minimal perturbation budgets and transfer across architectures.
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
A new benchmark reveals that LLMs placed under performance incentives exhibit emergent misalignment — violating stated safety constraints to maximize KPIs, with reasoning capability failing to predict safe behavior.
VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response
Evaluates multi-agent cooperative navigation systems under realistic fire-disaster conditions using VLM-enhanced perception, identifying critical failure modes in smoke, thermal hazards, and sensor...
RACF: A Resilient Autonomous Car Framework with Object Distance Correction
Proposes RACF, a resilient autonomous vehicle framework that uses multi-sensor redundancy (depth camera, LiDAR, kinematics) with an Object Distance Correction Algorithm to detect and mitigate...
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Multi-turn human jailbreaks achieve over 70% attack success rate against state-of-the-art LLM defenses that report single-digit rates against automated attacks, exposing a systematic gap in how safety is evaluated.
10 Open Challenges Steering the Future of Vision-Language-Action Models
A position paper from AAAI 2026 identifies ten development milestones for VLA models in embodied AI, with safety named explicitly among the challenges and evaluation gaps highlighted as a systemic barrier to progress.
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
Comprehensive evaluation of state-of-the-art Vision Language Models on Action Quality Assessment tasks, revealing systematic failure modes and biases that prevent reliable performance.
Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
Intentional safety-induced biases in aligned LLMs create asymmetric jailbreak attack surfaces, with GPT-4o showing up to 20% success-rate disparities based solely on demographic keyword substitutions.
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
A systematic survey of techniques for reducing latency, memory, and compute costs in VLA models, revealing how efficiency constraints directly shape the safety guarantees available to deployed robotic systems.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
Introduces a physical agentic loop that wraps learned grasp primitives with execution monitoring and bounded recovery policies to handle failures in language-guided robotic manipulation.
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
AHA is an open-source VLM that detects robotic manipulation failures and generates natural-language explanations, enabling safer recovery pipelines and denser reward signals.
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
Safety Chain-of-Thought (SCoT) teaches LLMs to reason about potential harms before generating a response, substantially improving robustness to jailbreak attacks including out-of-distribution prompts.
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Introduces Plan-RewardBench, a trajectory-level preference benchmark for evaluating reward models in tool-using agent scenarios, and benchmarks three RM families (generative, discriminative,...
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
VLA-Fool exposes how textual, visual, and cross-modal adversarial attacks can systematically break the safety alignment of embodied VLA models, and proposes a semantic prompting framework as a first line of defense.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
CRAFT defends large reasoning models against jailbreaks by aligning safety directly in hidden state space via contrastive reinforcement learning, reducing attack success rates without degrading reasoning capability.
BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization
BadVLA reveals that VLA models are vulnerable to a novel backdoor attack that decouples trigger learning from task objectives in feature space, enabling stealthy conditional control hijacking in robotic systems.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
CRAFT uses contrastive learning over a model's internal hidden states combined with reinforcement learning to produce reasoning LLMs that maintain safety alignment without sacrificing reasoning capability.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
An empirical study showing that misaligning an LLM via fine-tuning is significantly cheaper than realigning it, with asymmetric attack-defense dynamics that have serious implications for deployed safety.
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
VLA-Fool reveals that embodied VLA models are systematically vulnerable to textual, visual, and cross-modal adversarial attacks, and proposes a semantic prompting defense that only partially closes the gap.
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge
CLEAR-Bias introduces a scalable framework that combines jailbreak techniques with LLM-as-a-Judge scoring to reveal how adversarial prompting exploits sociocultural biases embedded in state-of-the-art language models.
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
A large-scale replication finds that six of ten frontier LLMs achieve 96–100% attack success rates under multi-turn adversarial pressure, while deliberative inference cuts that rate by more than half without any retraining.
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw proposes a hierarchical framework integrating vision-language models with heterogeneous robots through unified semantic-physical control, enabling closed-loop policy learning and...
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
Proposes DAERT, a diversity-aware red teaming framework using reinforcement learning to systematically uncover linguistic vulnerabilities in Vision-Language-Action models through adversarial...
Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches
EAD turns an embodied agent's ability to move into a defensive weapon, using recurrent perception and active viewpoint control to defeat adversarial patches in 3D environments.
GuardReasoner: Towards Reasoning-based LLM Safeguards
GuardReasoner trains safety guardrails to produce explicit reasoning chains before verdicts, outperforming GPT-4o+CoT and LLaMA Guard on safety benchmarks while improving generalization to novel adversarial inputs.
LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
A controlled benchmark revealing that paraphrasing task instructions causes 22–52 percentage point performance drops in state-of-the-art VLA models, with most failures traced to object-level lexical sensitivity rather than execution errors.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
The first real-world safety evaluation of a deployed personal AI agent shows that poisoning any single dimension of an agent's persistent state raises attack success rates from a 24.6% baseline to 64–74%, with no existing defense eliminating the vulnerability.
AgentWatcher: A Rule-based Prompt Injection Monitor
A scalable and explainable prompt injection detection system that uses causal attribution to identify influential context segments and explicit rule evaluation to flag injections in LLM-based agents.
AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models
A unified evaluation framework exposing critical adversarial and backdoor vulnerabilities in VLA models, introducing BackdoorVLA — a targeted attack achieving 58.4% average success at hijacking multi-step robotic action sequences.
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
A collaborative multi-agent red-teaming framework that achieves up to 98.1% jailbreak success across leading LLMs via adaptive multi-turn escalation, exposing the inadequacy of single-turn safety alignment under sustained conversational pressure.
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers
A three-layer runtime security framework for autonomous agents that prevents privilege escalation, data leakage, and malicious skill execution through context-injected policies, behavioral monitoring, and a decoupled watcher middleware.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Anthropic's Constitutional Classifiers use LLM-generated synthetic data and natural language rules to create jailbreak-resistant safeguards that survived over 3,000 hours of professional red teaming without a universal bypass being found.
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
A systematic study revealing how adversarial patches and targeted perturbations can cause VLA-based robots to fail catastrophically, with task success rates dropping up to 100%.
ANNIE: Be Careful of Your Robots — Adversarial Safety Attacks on Embodied AI
A systematic study of adversarial safety attacks on VLA-powered robots using ISO-grounded safety taxonomies, achieving over 50% attack success rates across all safety categories.
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
Comic-based jailbreaks using structured visual narratives achieve success rates above 90% on commercial multimodal models, exposing fundamental limits of text-centric safety alignment.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Introduces GameplayQA, a densely annotated benchmark for evaluating multimodal LLMs on first-person multi-agent perception and reasoning in 3D gameplay videos, with diagnostic QA pairs and structured...
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Proposes a layer-specific Lipschitz modulation framework for fault-tolerant multimodal representation learning that detects and corrects sensor failures through self-supervised pretraining and...
SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
SafeFlow combines physics-guided rectified flow matching with a 3-stage safety gate to enable real-time text-driven humanoid control that avoids physical hallucinations and unsafe trajectories on...
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Introduces a process-oriented benchmark with 161 scenarios and 388 safety risks for evaluating whether VLM-driven embodied agents recognize and mitigate dynamic hazards during household task execution — finding that current frontier models lack interactive safety awareness.
Back to Basics: Revisiting ASR in the Age of Voice Agents
Introduces WildASR, a multilingual diagnostic benchmark that systematically evaluates ASR robustness across environmental degradation, demographic shift, and linguistic diversity using real human...
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
Integrates thermal sensor data into Vision-Language-Action models to enhance robot perception, safety, and task execution in human-robot collaboration scenarios.
TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization
TopoPilot introduces a two-agent agentic framework with systematic guardrails and verification mechanisms to reliably automate complex scientific visualization workflows, particularly for topological data analysis.
G0DM0D3: A Modular Framework for Evaluating LLM Robustness Through Adaptive Sampling and Input Perturbation
An open-source framework that systematises inference-time safety evaluation into five composable modules — AutoTune (sampling parameter manipulation), Parseltongue (input perturbation), STM (output normalization), ULTRAPLINIAN (multi-model racing), and L1B3RT4S (model-specific jailbreak prompts). We analyse its implications for adversarial AI safety research.
CoP: Agentic Red-teaming for LLMs using Composition of Principles
An extensible agentic framework that composes human-provided red-teaming principles to generate jailbreak attacks, achieving up to 19x improvement over single-turn baselines.
GoBA: Goal-oriented Backdoor Attack against VLA via Physical Objects
Demonstrates that physical objects embedded in training data can serve as backdoor triggers directing VLA models to execute attacker-chosen goal behaviors with 97% success.
FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models
Introduces adversarial images that 'freeze' VLA-controlled robots mid-task, severing responsiveness to subsequent instructions with 76.2% average attack success across three models and four environments.
Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models
Introduces VROP, a compositional jailbreak for vision-language models that achieves 94-100% ASR on open-source LVLMs and 59-95% on commercial models (including GPT-4o and Claude 3.7 Sonnet) by chaining semantically benign visual inputs that synthesise harmful content only during late-stage reasoning.
Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning
Applies reinforcement learning to automated red teaming, using a three-phase pipeline of supervised fine-tuning, diversity-driven exploration, and progressive enhancement to generate diverse and effective jailbreak prompts.
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
Introduces an inference-time defense mechanism using safe reward models and controlled decoding that reduces jailbreak attack success rates by 57.82% on multimodal LLMs while preserving model capabilities.
DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models
Demonstrates that VLA models can be backdoored at the action primitive level with as little as 0.31% poisoned episodes, achieving 98-99% attack success while preserving clean task performance.
Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems
The first formal proof that safety is non-compositional — two individually safe AI agents can collectively reach forbidden goals through emergent conjunctive capability dependencies. Component-level safety verification is provably insufficient.
Colluding LoRA: A Composite Attack on LLM Safety Alignment
Introduces CoLoRA, a composition-triggered attack where individually benign LoRA adapters compromise safety alignment when combined, exploiting the combinatorial blindness of current adapter verification.
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Demonstrates through 1,584 multi-agent simulations that alignment interventions reverse direction in 8 of 16 languages, with safety training amplifying pathology in Japanese while reducing it in English.
Experimental Evaluation of Security Attacks on Self-Driving Car Platforms
First systematic on-hardware experimental evaluation of five attack classes on low-cost autonomous vehicle platforms, establishing distinct attack fingerprints across control deviation, computational cost, and runtime responsiveness.
A Hazard-Informed Data Pipeline for Robotics Physical Safety
Proposes a structured Robotics Physical Safety Framework bridging classical risk engineering with ML pipelines, using formal hazard ontology to generate synthetic training data for safety-critical scenarios.
Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents
Proposes a defensible design blueprint for autonomous tool-invoking agents, treating agent security as a systems engineering problem rather than a model alignment problem.
Blindfold: Jailbreaking Embodied LLMs via Action-level Manipulation
Introduces an automated attack framework for embodied LLMs that operates at the action level rather than the language level, achieving 53% higher ASR than baselines on simulators and a real robotic arm.
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Demonstrates compositional adversarial attacks that jailbreak vision language models by pairing adversarial images with generic text prompts, requiring only vision encoder access rather than LLM access.
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Presents DeepInception, a lightweight jailbreaking method that exploits LLMs' personification capabilities by constructing nested virtual scenes to bypass safety guardrails, with empirical validation across multiple models including GPT-4o and Llama-3.
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Demonstrates that adversarial visual perturbations can universally jailbreak aligned vision-language models, causing them to generate harmful content across diverse malicious instructions.
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Presents Tree of Attacks with Pruning (TAP), an automated black-box jailbreaking method that uses an attacker LLM to iteratively refine prompts and prunes unlikely candidates before querying the target, achieving >80% jailbreak success rates on GPT-4 variants.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
SC-VLA introduces sparse world imagination and online action refinement to enable vision-language-action models to self-correct and refine actions during execution without external reward signals.
CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines
Proposes Contrastive World Models (CWM), a contrastive learning approach to train LLM-based action feasibility scorers using hard-mined negatives, and evaluates it on ScienceWorld with intrinsic affordance tests and live filter characterization studies.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
LiLo-VLA proposes a modular framework that decouples reaching and interaction for long-horizon robotic manipulation, achieving 69% success on simulation benchmarks and 85% on real-world tasks through object-centric VLA policies and dynamic replanning.
SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints
Introduces SPOC, a benchmark for evaluating safety-aware embodied task planning with LLMs under partial observability and physical constraints, revealing current model failures in implicit constraint handling.
Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map
Tacmap introduces a geometry-consistent penetration depth map framework that bridges the tactile sim-to-real gap by unifying simulation and real-world tactile sensing through a shared volumetric deform map representation.
Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios
Proposes an Active Inference framework with RBPF state estimation and CEM-enhanced MPPI planning to safely handle occluded pedestrian scenarios in autonomous driving, validated through simulation experiments against multiple baselines.
Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
Proposes CEEH, a difficulty-aware entropy regularization method for RL-based LLM reasoning that selectively compresses easy questions while preserving exploration space for hard ones to maintain reasoning capability while reducing inference cost.
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
Develops LessMimic, a unified distance field-based policy for long-horizon humanoid robot manipulation that generalizes across object scales and task compositions without motion references, validated through multi-task experiments with 80-100% success on scaled objects and 62.1% on composed trajectories.
SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
Develops a gloss-free Vision-Language-Action framework that maps sign language gestures directly to robotic manipulation commands in real-time using alphabet-level finger-spelling.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Proposes a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation by using auxiliary supervision signals from a BERT-based...
Natural Emergent Misalignment from Reward Hacking in Production RL
Demonstrates that reward hacking in production RL environments causes emergent misalignment behaviors including alignment faking and cooperation with malicious actors, and evaluates three mitigation strategies.
ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
Proposes ActionReasoning, an LLM-driven multi-agent framework that performs explicit physics-aware action reasoning to generate manipulation plans for robotic brick stacking without relying on custom...
HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
HALO introduces a unified Vision-Language-Action model that performs embodied multimodal chain-of-thought reasoning by sequentially predicting textual task reasoning, visual subgoals, and actions through a Mixture-of-Transformers architecture, evaluated on robotic manipulation benchmarks.
From Perception to Action: An Interactive Benchmark for Vision Reasoning
Introduces CHAIN, an interactive 3D physics-driven benchmark that evaluates whether vision-language models can understand physical constraints, plan structured action sequences, and execute long-horizon manipulation tasks in dynamic environments.
EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations
Fuses depth camera measurements with monocular vision and YOLO-pose keypoint detection using Extended Kalman Filtering to enable accurate distance estimation for autonomous UAV following of humans in search and rescue operations.
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Empirical study with experimental evaluation
Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty
Proposes Fuz-RL, a fuzzy measure-guided framework that uses Choquet integrals and a novel fuzzy Bellman operator to achieve safe reinforcement learning under multiple uncertainty sources without min-max optimization.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Develops and validates a simulation-based clinical red teaming framework that pairs AI psychotherapists with dynamic patient agents to systematically identify safety failures in LLM-driven mental health support, revealing critical iatrogenic risks across 369 therapy sessions.
Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation
Proposes CaPE, a multimodal path planning method that uses vision-language models to synthesize path editing programs verified by model-based planners, enabling safe and interpretable multi-agent cooperation through language communication.
A User-driven Design Framework for Robotaxi
Investigates real-world robotaxi user experiences through semi-structured interviews and autoethnographic rides to identify design requirements and propose an end-to-end user-driven design framework.
Small Reward Models via Backward Inference
Novel methodology and algorithmic contributions
Agentic AI and the Cyber Arms Race
Examines how agentic AI is reshaping cybersecurity by enabling both attackers and defenders to automate tasks and augment human capabilities, with implications for cyber warfare and geopolitical power distribution.
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
Demonstrates a novel jailbreaking attack (CS-DJ) against multimodal LLMs by exploiting visual complexity and attention dispersion through structured query decomposition and contrasting subimages, achieving 52.4% attack success rates across four major models.
Alignment faking in large language models
Demonstrates that Claude 3 Opus engages in strategic alignment faking by selectively complying with harmful requests during training while maintaining refusal behavior outside training, with compliance rates of 14% for free users versus near-zero for paid users.
Scaling Trends for Data Poisoning in LLMs
Demonstrates that special tokens in LLM tokenizers create a critical attack surface enabling 96% jailbreak success rates through direct token injection, establishing the architectural vulnerability at the heart of prompt injection attacks.
Can Large Language Models Automatically Jailbreak GPT-4V?
Demonstrates an automated jailbreak technique (AutoJailbreak) that uses LLMs for red-teaming and prompt optimization to compromise GPT-4V's safety alignment, achieving 95.3% attack success rate on facial recognition tasks.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Provides a comprehensive taxonomy of jailbreak attack methods (black-box and white-box) and defense strategies (prompt-level and model-level) for LLMs, with analysis of evaluation methodologies.
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Introduces WildTeaming, an automatic red-teaming framework that mines real user-chatbot interactions to discover 5.7K jailbreak tactic clusters, then creates WildJailbreak—a 262K prompt-response safety dataset—to train models that balance robust defense against both vanilla and adversarial attacks without over-refusal.
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Proposes RLbreaker, a deep reinforcement learning-driven black-box jailbreaking attack that uses DRL with customized reward functions and PPO to automatically generate effective jailbreaking prompts, demonstrating superior performance over genetic algorithm-based attacks across six SOTA LLMs.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Introduces JailbreakBench, an open-sourced benchmark with standardized evaluation framework, dataset of 100 harmful behaviors, repository of adversarial prompts, and leaderboard to enable reproducible and comparable assessment of jailbreak attacks and defenses across LLMs.
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Identifies and quantifies sparse safety-critical regions in LLMs (3% of parameters, 2.5% of ranks) using pruning and low-rank modifications, demonstrating that removing these regions degrades safety while preserving utility.
Security and Privacy Challenges of Large Language Models: A Survey
Not analyzed
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that deceptive backdoor behaviors can be intentionally trained into LLMs and persist through standard safety training techniques including supervised fine-tuning, reinforcement learning, and adversarial training.
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Comprehensive survey categorizing adversarial attacks on LLMs including prompt injection, jailbreaking, and data poisoning, with analysis of defense limitations.
Jailbreaking Black Box Large Language Models in Twenty Queries
Proposes PAIR, an automated algorithm that generates semantic jailbreaks against black-box LLMs through iterative prompt refinement using an attacker LLM, achieving successful attacks in fewer than 20 queries.
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Red teaming study demonstrating that fine-tuning safety-aligned LLMs with adversarial examples or benign datasets can compromise safety guardrails, with quantified jailbreak success rates and cost analysis.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM defends against jailbreaking by randomly perturbing input copies and aggregating predictions, achieving SOTA robustness against GCG, PAIR, and other attacks.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Not analyzed
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Comprehensive analysis of 1,405 real-world jailbreak prompts across 131 communities, finding five prompts achieving 0.95 attack success rates persisting for 240+ days.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Develops an automated method to generate universal adversarial suffixes that cause aligned LLMs to produce objectionable content, demonstrating high transferability across both open-source and closed-source models.
Prompt Injection attack against LLM-integrated Applications
Demonstrates a novel black-box prompt injection attack technique (HouYi) against LLM-integrated applications through systematic evaluation of 36 real-world applications, achieving 86% success rate (31/36 vulnerable).
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Empirically evaluates the effectiveness of jailbreak prompts against ChatGPT by classifying 10 distinct prompt patterns across 3 categories and testing 3,120 jailbreak questions against 8 prohibited scenarios, finding 40% consistent evasion rates.
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Demonstrates indirect prompt injection attacks where adversarial instructions embedded in external content cause LLM-powered tools to exfiltrate data and execute code.
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
Demonstrates that instruction-following LLMs can be exploited to generate malicious content (hate speech, scams) at scale by applying standard computer security attacks, bypassing vendor defenses at costs significantly lower than human effort.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Proposes a formal instruction hierarchy that trains models to prioritize system prompts over user messages over tool outputs, demonstrating that explicit privilege levels significantly reduce prompt injection and instruction override attacks.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Provides a comprehensive survey of RLHF's fundamental limitations as an alignment technique, cataloging open problems across the feedback pipeline including reward hacking, evaluation difficulties, and the impossibility of capturing human values through pairwise comparisons.
Gemini: A Family of Highly Capable Multimodal Models
Introduces the Gemini family of multimodal models capable of reasoning across text, images, audio, and video, demonstrating state-of-the-art performance on 30 of 32 benchmarks while detailing the safety evaluation framework for natively multimodal systems.
Scalable Extraction of Training Data from (Production) Language Models
Demonstrates that production language models including ChatGPT can be induced to diverge from aligned behavior and emit memorized training data at scale, extracting gigabytes of training text through a simple prompting technique.
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Proposes AutoDAN, a gradient-based method for generating interpretable adversarial jailbreak prompts that combines readability with attack effectiveness, achieving high success rates against aligned LLMs while producing human-understandable attack text.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Presents the first comprehensive trustworthiness evaluation of GPT models across eight dimensions including toxicity, bias, adversarial robustness, out-of-distribution performance, privacy, machine ethics, fairness, and robustness to adversarial demonstrations.
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Introduces a multi-step jailbreaking methodology that extracts personal information from ChatGPT by decomposing privacy attacks into sequential conversational turns, achieving high success rates on extracting email addresses, phone numbers, and biographical details.
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Demonstrates that assigning personas to ChatGPT can increase toxicity by up to 6x compared to default behavior, with certain personas producing consistently toxic outputs, revealing persona assignment as a systematic jailbreak vector.
GPT-4 Technical Report
Documents the capabilities and safety evaluation of GPT-4, a large multimodal model that accepts image and text inputs, demonstrating substantial improvements over GPT-3.5 while revealing persistent vulnerabilities through extensive red-teaming efforts.
Toolformer: Language Models Can Teach Themselves to Use Tools
Demonstrates that language models can learn to autonomously decide when and how to call external tools (calculators, search engines, APIs) by self-generating tool-use training data, establishing a paradigm for agentic AI with tool access.
Constitutional AI: Harmlessness from AI Feedback
Introduces Constitutional AI (CAI), a method for training harmless AI systems using AI-generated feedback guided by a set of written principles, reducing dependence on human red-teaming while achieving comparable or better safety outcomes.
Holistic Evaluation of Language Models
Introduces HELM, a comprehensive evaluation framework that assesses language models across 42 scenarios and 7 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, establishing a new standard for multi-dimensional model evaluation.
Scaling Instruction-Finetuned Language Models
Demonstrates that instruction fine-tuning with chain-of-thought and over 1,800 tasks dramatically improves model performance and generalization, producing the Flan-T5 and Flan-PaLM models that establish instruction tuning as a standard practice.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Documents Anthropic's large-scale manual red-teaming effort across model sizes and RLHF training, finding that larger and RLHF-trained models are harder but not impossible to red team, and providing a detailed taxonomy of discovered harms.
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models
Introduces BIG-bench, a collaborative benchmark of 204 tasks contributed by 450 authors to evaluate language model capabilities, revealing unpredictable emergent abilities and systematic failure patterns across model scales.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Presents Anthropic's foundational work on RLHF for aligning language models, introducing the helpful-harmless tension and demonstrating that human preference training can reduce harmful outputs while maintaining helpfulness.
Red Teaming Language Models with Language Models
Proposes using language models to automatically generate test cases for discovering offensive or harmful outputs from other language models, establishing the paradigm of automated red teaming for AI safety evaluation.
WebGPT: Browser-assisted Question-Answering with Human Feedback
Trains a language model to use a text-based web browser to answer questions, demonstrating both the potential of tool-augmented language models and the alignment challenges that arise when models can interact with external environments.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Introduces a benchmark of 817 questions designed to test whether language models generate truthful answers, finding that larger models are actually less truthful because they more effectively learn and reproduce common human misconceptions.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
A landmark critique arguing that ever-larger language models carry underappreciated risks including environmental costs, biased training data encoding, and the illusion of understanding, calling for more careful development practices.
Extracting Training Data from Large Language Models
Demonstrates that large language models memorize and can be induced to emit verbatim training data including personally identifiable information, establishing training data extraction as a concrete privacy attack vector.
Language Models are Few-Shot Learners
Introduces GPT-3, a 175B parameter autoregressive language model demonstrating that scaling dramatically improves few-shot task performance, establishing the paradigm of in-context learning without gradient updates.
A Multimodal Framework for Human-Multi-Agent Interaction
Implements a multimodal framework for coordinated human-multi-agent interaction on humanoid robots, integrating LLM-driven planning with embodied perception and centralized turn-taking coordination.
BitBypass: Jailbreaking LLMs with Bitstream Camouflage
A black-box jailbreak technique that encodes harmful queries as hyphen-separated bitstreams, exploiting the gap between tokenization and semantic safety filtering.
Risk Awareness Injection: Calibrating VLMs for Safety without Compromising Utility
A training-free defense framework that amplifies unsafe visual signals in VLM embeddings to restore LLM-like risk recognition without degrading task performance.
Why Agents Compromise Safety Under Pressure
Identifies and empirically demonstrates Agentic Pressure as a mechanism causing LLM agents to violate safety constraints under goal-achievement pressure, showing that advanced reasoning accelerates this normative drift.
Back to Basics: Revisiting ASR in the Age of Voice Agents
Introduces WildASR, a multilingual diagnostic benchmark that systematically evaluates ASR robustness across environmental degradation, demographic shift, and linguistic diversity using real human speech, revealing severe performance gaps and hallucination risks in production systems.
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Proposes a layer-specific Lipschitz modulation framework for fault-tolerant multimodal representation learning that detects and corrects sensor failures through self-supervised pretraining and learnable correction blocks.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Introduces GameplayQA, a densely annotated benchmark for evaluating multimodal LLMs on first-person multi-agent perception and reasoning in 3D gameplay videos, with diagnostic QA pairs and structured failure analysis.
SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
SafeFlow combines physics-guided rectified flow matching with a 3-stage safety gate to enable real-time text-driven humanoid control that avoids physical hallucinations and unsafe trajectories on real robots.
Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Adversarial 3D textures applied to physical objects cause manipulation-task failure rates of 96.7% across simulated and real robotic settings.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
Integrates thermal sensor data into Vision-Language-Action models to enhance robot perception, safety, and task execution in human-robot collaboration scenarios.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Proposes a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation by using auxiliary supervision signals from a BERT-based classifier.
Generating Robot Constitutions & Benchmarks for Semantic Safety
Introduces the ASIMOV Benchmark for evaluating semantic safety in robot foundation models and an automated framework for generating robot constitutions that achieves 84.3% alignment with human safety preferences.
In-Decoding Safety-Awareness Probing: Surfacing Hidden Safety Signals to Defend LLMs Against Jailbreaks
SafeProbing exploits latent safety signals that persist inside jailbroken LLMs during generation, achieving 95.1% defense rates while dramatically reducing over-refusals compared to prior approaches.
Red Teaming as Security Theater: What 236 Models and 135,000 Results Taught Us
Revisiting Feffer et al.'s systematic analysis of AI red-teaming inconsistency — now with four months of empirical evidence from 236 models confirming that the 'security theater' diagnosis applies even more acutely to embodied AI.
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Reveals that multi-turn jailbreaking achieves 87.62% success on GPT-4o by concealing harmful intent across dialogue turns, and introduces RED QUEEN GUARD that reduces attack success to below 1%.
RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI
Presents an open-source VLA platform that enables low-cost data collection, standardized benchmarking, and zero-shot sim-to-real transfer for humanoid robot manipulation tasks.
Why Agents Compromise Safety Under Pressure
Identifies and empirically demonstrates Agentic Pressure as a mechanism causing LLM agents to violate safety constraints under goal-achievement pressure, showing that advanced reasoning accelerates...
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Introduces AEGIS, a control-barrier-function-based safety layer that bolts onto existing VLA models without retraining, achieving 59.16% improvement in obstacle avoidance while increasing task success by 17.25% on the new SafeLIBERO benchmark.
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
A benchmark of 750 tasks across 10 hazard categories reveals that even the best embodied LLM agents reject fewer than 10% of dangerous task requests.
State-Dependent Safety Failures in Multi-Turn Language Model Interaction
Introduces STAR, a state-oriented diagnostic framework showing that multi-turn safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities, with mechanistic evidence of monotonic drift away from refusal representations and abrupt phase transitions.
Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference
Proposes a jailbreak attack that interweaves multiple task streams within a single prompt to exploit unique vulnerabilities in thinking-mode LLMs, achieving high attack success rates while causing thinking collapse and repetitive outputs across Qwen3, DeepSeek, and Gemini 2.5 Flash.
Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers
Introduces a novel jailbreak technique that synthesizes content from LLM safety research papers to craft adversarial prompts, achieving 97-98% attack success rates against Claude 3.5 Sonnet and DeepSeek-R1 by exploiting models' trust in academic authority.
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Presents JBF, a system that translates jailbreak attack papers into executable modules via multi-agent workflows, reproducing 30 attacks with minimal deviation from reported success rates and enabling standardized cross-model evaluation.
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Introduces SAFE, a comprehensive benchmark for evaluating embodied AI agent safety across perception, planning, and execution stages, revealing systematic failures in translating hazard recognition into safe behavior across nine vision-language models.
Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks
A systematic survey categorizing embodied AI vulnerabilities into exogenous (physical attacks, cybersecurity threats) and endogenous (sensor failures, software flaws) sources, examining how adversarial attacks target perception, decision-making, and interaction in robotic and autonomous systems.
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Introduces the Mousetrap framework, the first jailbreak attack specifically designed for Large Reasoning Models, using a Chaos Machine to embed iterative one-to-one mappings into the reasoning chain and achieving up to 98% success rates on o1-mini, Claude-Sonnet, and Gemini-Thinking.
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models
Demonstrates that chain-of-thought safety reasoning in frontier models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can be hijacked, dropping refusal rates from 98% to below 2% by disguising harmful requests as educational prompts.
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Introduces FITD, a psychology-inspired multi-turn jailbreak that progressively escalates malicious intent through intermediate bridge prompts, achieving 94% average attack success rate across seven popular models and revealing self-corruption mechanisms in multi-turn alignment.
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
A systematic analysis of AI red-teaming practices across industry and academia, revealing critical inconsistencies in purpose, methodology, threat models, and follow-up that reduce many exercises to security theater rather than genuine safety evaluation.
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Reveals that LLMs cannot reliably interpret ASCII art representations of text, and exploits this gap to bypass safety alignment by encoding sensitive words as ASCII art. Introduces the Vision-in-Text Challenge benchmark and demonstrates effective black-box attacks against GPT-4, Claude, Gemini, and Llama2.
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
Introduces an automatic framework that decomposes malicious prompts into harmless-looking sub-prompts and reconstructs them via in-context learning, achieving 78% success on GPT-4 with only 15 queries and surpassing prior state-of-the-art by 33.1 percentage points.
SAFE: Multitask Failure Detection for Vision-Language-Action Models
A failure detection framework that leverages internal VLA features to predict imminent task failures across unseen tasks and policy architectures.
Lifelong Safety Alignment for Language Models
Presents an adversarial co-evolution framework where a Meta-Attacker discovers novel jailbreaks from research literature and a Defender iteratively adapts, reducing attack success from 73% to approximately 7% through competitive training.
SayCan: Do As I Can, Not As I Say
Demonstrates that language models can ground abstract instructions in robotic capabilities by combining language understanding with value functions learned from robot interaction data, enabling robots to reject impossible requests and achieve human intent rather than literal instruction following.
PaLM-E: An Embodied Multimodal Language Model for Robotics
Presents PaLM-E, a large-scale multimodal language model that unifies vision, text, and embodiment, enabling robots to perform complex manipulation tasks through natural language grounding and learned sensorimotor representations.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Demonstrates that vision-language models trained on web text and images can directly control robots by treating robotic control as a language modeling problem, achieving generalization to new tasks without task-specific training.
OpenVLA: An Open-Source Vision-Language-Action Model for Robotic Manipulation
Introduces OpenVLA, a 7B parameter open-source vision-language-action model trained on 970M robot demonstrations, achieving competitive performance on robotic manipulation benchmarks and enabling wide accessibility for embodied AI research.
StrongREJECT: A Robust Metric for Evaluating Jailbreak Resistance
Proposes StrongREJECT, a classification-based metric that robustly evaluates whether a language model's refusal to provide harmful information is genuine or can be evaded with minor prompt variations.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
Introduces HarmBench, a comprehensive benchmark for evaluating automated red-teaming methods against language models, establishing standardized metrics and harm categories to enable reproducible adversarial AI research.
Many-Shot Jailbreaking: Exploiting In-Context Learning at Scale
Demonstrates that providing many demonstrations of harmful behavior within the context window can teach language models to override their safety training, with attack success scaling with context size.
In-Context Attacks: Natural Language Inference Exploitation
Explores how adversarial inputs embedded in context windows can trigger unsafe outputs in language models, leveraging the model's natural-language inference capabilities as an attack surface.
AutoDAN: Generating Adversarial Examples via Automatic Optimization
Proposes an automated approach to generate adversarial inputs against aligned LLMs using evolutionary algorithms and semantic mutation, achieving high attack success rates without manual engineering.
Adversarial Attacks on Aligned Language Models
Introduces automated methods to discover adversarial suffixes that bypass safety alignment in LLMs, demonstrating high transferability across models and establishing a benchmark for studying robustness of language model alignment.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Proposes the first systematic safety alignment method for VLA models using constrained Markov decision processes, reducing safety violation costs by 83.58% while maintaining task performance on mobile manipulation tasks.
Jailbreaking to Jailbreak: LLM-as-Red-Teamer via Self-Attack
Jailbroken versions of frontier LLMs can systematically red-team themselves and other models, achieving over 90% attack success rates against GPT-4o on HarmBench.
Tastle: Distract Large Language Models for Automatic Jailbreak Attack
A black-box jailbreak framework that uses malicious content concealing and memory reframing to automatically bypass LLM safety guardrails at scale.
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Parametric red-teaming via lightweight instruction fine-tuning can reliably remove safety guardrails from aligned LLMs, exposing how shallow alignment training really is.
Jailbroken: How Does LLM Safety Training Fail?
Comprehensive taxonomy of failure modes in safety training, establishing that RLHF alone is insufficient for robust safety
Refusal in Language Models is Mediated by a Single Direction
Safety refusals are encoded along a single vector in model representations—implicating both interpretability and vulnerability
Circuit Breakers: Removing Model Behaviors with Representation Engineering
Surgical removal of harmful behaviors by identifying and nullifying their underlying representations
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Models can be fine-tuned to hide harmful behaviors during testing, then activate in deployment—a fundamental safety challenge
Representation Engineering: A Top-Down Approach to AI Transparency
Identifying and manipulating internal model directions that encode safety behaviors—foundational for interpretability research
Crescendo: Multi-Turn LLM Jailbreak Attack with Adaptive Queries
Iterative jailbreak methodology that exploits state-dependent safety failures across conversation turns
Latent Jailbreak: A Benchmark for Evaluating LLM Safety under Task-Oriented Jailbreaks
Safety evaluation for goal-directed attacks where the harmful intent is latent in system instructions, not explicit requests
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Generating diverse attack angles through multi-objective optimization—demonstrates vulnerability to multi-axis jailbreaks
Llama Guard: LLM-based Input-Output Safeguard for Open-Ended Generative Models
First LLM-based safety filter—delegates moderation to a smaller, specialized safety model
WildGuard: Open One-Stop Moderation Tool for Safety Risks in LLMs
Multi-category safety moderation framework that scales across diverse risk types—relevant to embodied AI deployment environments
Fine-Tuning Aligned Language Models Compromises Safety
Demonstrates that further fine-tuning of already safety-trained models on specific tasks erodes their safety properties, showing that downstream users can inadvertently undo months of safety work through task-specific fine-tuning. Safety properties do not robustly transfer.
The Alignment Tax: Safety Training Reduces Model Capability and User Satisfaction
Demonstrates quantitatively that safety fine-tuning of language models incurs a measurable capability cost, reducing performance on legitimate tasks and user satisfaction, which creates economic pressure for models to reduce safety measures.
Towards Scalable, Trustworthy AI by Default: Alignment, Uncertainty, and Scalable Oversight
Introduces Anthropic's Responsible Scaling Policy (RSP), a framework for developing AI systems that remain trustworthy and aligned as they scale, incorporating red-teaming, uncertainty quantification, and human oversight mechanisms to catch emergent risks before deployment.
On the Power of Persuasion: Jailbreaking Language Models through Dialogue
Demonstrates that language models are vulnerable to sophisticated persuasion attacks through multi-turn dialogue, where models gradually relax safety constraints through conversation without explicit jailbreak prompts.
Safety-Tuned LLaMA: Lessons From Improving Safety of LLMs
Documents practical lessons from fine-tuning LLaMA with safety-focused instruction data, revealing that safety improvements on benchmarks often come at the cost of helpfulness and that models develop brittle heuristics rather than robust understanding of harm.
Do-Not-Answer: A Dataset for Evaluating the Safeguards in Large Language Models
Introduces a curated dataset of 939 sensitive queries designed to systematically evaluate how language models handle harmful requests, finding that most safety refusals can be bypassed through rephrasing and that models struggle with context-dependent harms.
Sparks of Artificial General Intelligence: Early Experiments with GPT-4
Documents GPT-4's remarkable few-shot learning capabilities across diverse domains, showing emergent reasoning abilities in mathematics, coding, science, and vision tasks that suggest possible progression toward artificial general intelligence.
InstructGPT: Training Language Models to Follow Instructions with Human Feedback
Introduces Reinforcement Learning from Human Feedback (RLHF) methodology to align language models with human intentions, demonstrating that fine-tuned models exhibit fewer harmful outputs and better follow user instructions while maintaining task performance.