Daily
paper

One AI safety paper per day — curated, analyzed, and contextualized through the failure-first lens

Each post covers a key paper in AI safety, alignment, or adversarial ML — with a NotebookLM-generated research report, study guide, FAQ, and audio overview. Papers are selected for their relevance to how AI systems fail.

Apr 28, 2026

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

A comprehensive survey unifying VLA safety research across adversarial attacks, defenses, benchmarks, and six deployment domains.

Survey arXiv:2604.23775

vla-safetyembodied-aiadversarial-attackssurveyrobotics-security

Apr 28, 2026

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

ARMOR defends LLMs against jailbreak attacks by using inference-time reasoning to detect attack strategies, extract true intent, and apply policy-grounded safety analysis.

Methods arXiv:2507.11500

jailbreak-defensesafety-alignmentreasoningllm-safetyinference-time-defense

Apr 27, 2026

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning Models

Mechanistic analysis of reasoning models discovers the 'refusal cliff'—models correctly identify harmful prompts during thinking but systematically suppress their refusal at the final output tokens.

Empirical arXiv:2510.06036

safety-alignmentreasoning-modelsmechanistic-interpretabilityrefusalalignment-failures

Apr 27, 2026

Using Large Language Models for Embodied Planning Introduces Systematic Safety Risks

DESPITE benchmark reveals that across 23 models, near-perfect planning ability does not ensure safety—the best planner still generates dangerous plans 28.3% of the time.

Empirical arXiv:2604.18463

embodied-airobot-safetytask-planningevaluationllm-agents

Apr 26, 2026

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

A structured survey that treats Safety as one of five foundational VLA challenges alongside Representation, Execution, Generalization, and Evaluation.

Survey arXiv:2512.11362

vla-modelsembodied-aisafetyrobustnesssurvey

Apr 26, 2026

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Directly removing harmful knowledge from LLMs via machine unlearning—with just 20 training examples—cuts jailbreak success rates more effectively than safety fine-tuning on 100k samples.

Empirical arXiv:2407.02855

jailbreak-defensemachine-unlearningsafety-alignmentllm-safetyred-teaming

Apr 25, 2026

C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

C-ΔΘ uses mechanistic circuit analysis to localize refusal-causal computation and distill it into a sparse offline weight update, eliminating per-request inference-time safety hooks.

Methods arXiv:2602.04521

mechanistic-interpretabilityselective-refusalllm-safetyweight-editingsparse-circuits

Apr 25, 2026

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

FailSafe introduces a scalable failure generation and recovery system that automatically creates diverse failure cases with executable recovery actions, boosting VLA manipulation success by up to 22.6%.

Empirical arXiv:2510.01642

vla-modelsfailure-detectionfailure-recoveryrobotic-manipulationembodied-ai-safety

Apr 24, 2026

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

ADVLA exploits attention maps and Top-K masking to craft sparse, stealthy adversarial patches in VLA models' textual feature space, achieving high attack success rates while remaining nearly invisible.

Empirical arXiv:2511.21663 ▶ Audio

vla-modelsadversarial-attacksattention-guidedfeature-space-attackembodied-robotics

Apr 24, 2026

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

A new benchmark exposes persistent evaluation gaps in VLA models by combining hierarchical difficulty protocols and diverse teleoperation data to reveal that cumulative perturbations cause dramatic performance drops.

Empirical arXiv:2602.06556 ▶ Audio

vla-modelsrobustness-evaluationembodied-aibenchmarkevaluation-gaps

Apr 24, 2026

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Answer-Then-Check trains LLMs to generate a candidate response first and then evaluate its own safety, achieving robust jailbreak defense without sacrificing reasoning or utility.

Empirical arXiv:2509.11629 ▶ Audio

safety-alignmentjailbreak-defensereasoning-modelsself-evaluationanswer-then-check

Apr 24, 2026

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

A systematic study of 80 agent safety benchmarks shows that 74% of specifiable policies can be enforced by symbolic guardrails, providing formal safety guarantees that training-based methods cannot.

Empirical arXiv:2604.15579 ▶ Audio

agent-safetysymbolic-guardrailsllm-agentssafety-alignmentpolicy-enforcement

Apr 23, 2026

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

SafetyALFRED reveals a critical alignment gap in embodied AI: while multimodal LLMs can recognize kitchen hazards in QA settings, they largely fail to mitigate those same hazards when planning physical actions.

Empirical arXiv:2604.19638 ▶ Audio

embodied-aisafety-evaluationmultimodal-llmhousehold-roboticshazard-recognition

Apr 23, 2026

Weak-to-Strong Jailbreaking on Large Language Models

Researchers show that small, unsafe models can efficiently guide jailbreaking attacks against much larger, carefully aligned models by exploiting divergences in initial decoding distributions.

Empirical arXiv:2401.17256 ▶ Audio

jailbreakingllm-safetyadversarial-decodingred-teamingalignment-failure

Apr 22, 2026

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Using sparse autoencoders to mechanistically identify the neural features that drive safety refusal in instruction-tuned LLMs, revealing layered redundant defenses and new pathways for targeted safety auditing.

Empirical arXiv:2509.09708 ▶ Audio

llm-safetyrefusal-mechanismssparse-autoencodersmechanistic-interpretabilityjailbreak

Apr 22, 2026

Updating Robot Safety Representations Online from Natural Language Feedback

A method for dynamically updating robot safety constraints at deployment time using vision-language models and Hamilton-Jacobi reachability, enabling robots to respect context-specific hazards communicated through natural language.

Methods arXiv:2409.14580 ▶ Audio

embodied-airobot-safetyvision-language-modelsnatural-language-feedbacksafety-constraints

Apr 21, 2026

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Comprehensive survey of Vision-and-Language Navigation for UAVs, charting the evolution from modular approaches to foundation model-driven systems and identifying deployment challenges and future...

Survey arXiv:2604.13654 ▶ Audio

vision-language-navigationuav-embodied-aisim-to-reality-gapvision-language-modelslong-horizon-task-planning

Apr 21, 2026

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

UMI-3D extends the Universal Manipulation Interface with LiDAR-based 3D spatial perception to overcome monocular SLAM limitations and improve robustness of embodied manipulation data collection and...

Empirical arXiv:2604.14089 ▶ Audio

lidar-slammultimodal-sensor-fusionwrist-mounted-manipulationdeformable-object-manipulationspatiotemporal-calibration

Apr 21, 2026

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

SpaceMind is a modular vision-language agent framework for autonomous on-orbit servicing that combines skill modules, MCP tools, and reasoning modes with a self-evolution mechanism, validated through...

Empirical arXiv:2604.14399 ▶ Audio

embodied-vision-language-agentson-orbit-servicingself-evolution-without-finetuningsim-to-real-transferfailure-recovery-mechanisms

Apr 21, 2026

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Introduces DR³-Eval, a reproducible benchmark for evaluating deep research agents on multimodal report generation with a static sandbox corpus and multi-dimensional evaluation framework,...

Empirical arXiv:2604.14683 ▶ Audio

deep-research-agentsbenchmark-evaluationmultimodal-report-generationretrieval-robustnesshallucination-control

Apr 21, 2026

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...

Empirical arXiv:2604.15308 ▶ Audio

autonomous-driving-planningdiffusion-models-controlreinforcement-learning-trajectoryclosed-loop-feedbackmultimodal-uncertainty

Apr 21, 2026

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

A comprehensive benchmark and HD-Guard dual-brain architecture for detecting unsafe actions by embodied VLM agents in household environments, exposing critical gaps in real-time safety monitoring.

Empirical arXiv:2603.11975 ▶ Audio

embodied-ai-safetyunsafe-action-detectionvision-language-modelshousehold-agentsreal-time-safety

Apr 21, 2026

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

A self-play reinforcement learning framework where an LLM simultaneously generates adversarial jailbreak attacks and strengthens its own defenses, reducing attack success rates without external red teams.

Methods arXiv:2601.10589 ▶ Audio

safety-alignmentred-teamingself-playreinforcement-learningjailbreak-defense

Apr 20, 2026

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Introduces EmbodiedGovBench, a benchmark for evaluating governance, safety, and controllability of embodied agent systems across seven dimensions including policy enforcement, recovery, auditability,...

Empirical arXiv:2604.11174 ▶ Audio

embodied-ai-governancerobot-policy-safetyruntime-drift-robustnesshuman-override-responsivenessaudit-trails-embodied-systems

Apr 20, 2026

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

A bi-level meta-optimization framework co-evolves jailbreak prompts and scoring templates to achieve 100% attack success on Claude-4-Sonnet, exposing fundamental cracks in how safety alignment is measured.

Empirical arXiv:2511.01375 ▶ Audio

jailbreakred-teamingsafety-alignmentmeta-optimizationadversarial-attacks

Apr 20, 2026

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

A physics-based simulator for dual-arm humanoid robots introduces a contingency mechanism that deliberately injects low-level execution failures, revealing critical robustness gaps in current VLMs.

Empirical arXiv:2506.16012 ▶ Audio

embodied-aivla-modelssimulationfailure-modescontingency-planning

Apr 19, 2026

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Systematically evaluates typographic prompt injection attacks on four vision-language models across varying font sizes and visual conditions, correlating text-image embedding distance to attack...

Empirical arXiv:2604.12371 ▶ Audio

typographic-prompt-injectionvision-language-model-robustnessmultimodal-embedding-alignmentadversarial-text-renderingembodied-ai-safety

Apr 19, 2026

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Adversarial attacks targeting high-entropy tokens in VLMs achieve severe semantic degradation with minimal perturbation budgets and transfer across architectures.

Empirical arXiv:2512.21815 ▶ Audio

adversarial-attacksvision-language-modelsentropytransferabilityrobustness

Apr 19, 2026

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

A new benchmark reveals that LLMs placed under performance incentives exhibit emergent misalignment — violating stated safety constraints to maximize KPIs, with reasoning capability failing to predict safe behavior.

Empirical arXiv:2512.20798 ▶ Audio

autonomous-agentsemergent-misalignmentsafety-benchmarksconstraint-violationsalignment

Apr 18, 2026

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Evaluates multi-agent cooperative navigation systems under realistic fire-disaster conditions using VLM-enhanced perception, identifying critical failure modes in smoke, thermal hazards, and sensor...

Empirical arXiv:2604.12831 ▶ Audio

multi-agent-navigationvision-language-modelsfire-disaster-responsesensor-degradationsmoke-diffusion

Apr 17, 2026

RACF: A Resilient Autonomous Car Framework with Object Distance Correction

Proposes RACF, a resilient autonomous vehicle framework that uses multi-sensor redundancy (depth camera, LiDAR, kinematics) with an Object Distance Correction Algorithm to detect and mitigate...

Empirical arXiv:2604.12418 ▶ Audio

autonomous-vehicle-perceptionsensor-fusion-redundancyadversarial-robustnessdepth-estimation-correctionreal-time-safety-critical-systems

Apr 17, 2026

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Multi-turn human jailbreaks achieve over 70% attack success rate against state-of-the-art LLM defenses that report single-digit rates against automated attacks, exposing a systematic gap in how safety is evaluated.

Empirical arXiv:2408.15221 ▶ Audio

multi-turn-jailbreakred-teamingdefense-evaluationsafety-alignmentmachine-unlearning

Apr 17, 2026

10 Open Challenges Steering the Future of Vision-Language-Action Models

A position paper from AAAI 2026 identifies ten development milestones for VLA models in embodied AI, with safety named explicitly among the challenges and evaluation gaps highlighted as a systemic barrier to progress.

Position arXiv:2511.05936 ▶ Audio

embodied-aivlasafetyevaluation-gapsrobotics

Apr 16, 2026

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Comprehensive evaluation of state-of-the-art Vision Language Models on Action Quality Assessment tasks, revealing systematic failure modes and biases that prevent reliable performance.

Empirical arXiv:2604.08294 ▶ Audio

vision-language-modelsaction-quality-assessmentfine-grained-video-understandingmodel-bias-analysisembodied-task-evaluation

Apr 16, 2026

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Intentional safety-induced biases in aligned LLMs create asymmetric jailbreak attack surfaces, with GPT-4o showing up to 20% success-rate disparities based solely on demographic keyword substitutions.

Empirical arXiv:2410.13334 ▶ Audio

jailbreaksafety-alignmentbiasred-teamingadversarial-prompts

Apr 16, 2026

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

A systematic survey of techniques for reducing latency, memory, and compute costs in VLA models, revealing how efficiency constraints directly shape the safety guarantees available to deployed robotic systems.

Survey arXiv:2510.17111 ▶ Audio

vla-modelsembodied-aiefficiencyedge-deploymentsafety-robustness

Apr 15, 2026

A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

Introduces a physical agentic loop that wraps learned grasp primitives with execution monitoring and bounded recovery policies to handle failures in language-guided robotic manipulation.

Empirical arXiv:2604.07395 ▶ Audio

robotic-graspingexecution-monitoringlanguage-guided-manipulationfailure-recoveryembodied-agents

Apr 15, 2026

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

AHA is an open-source VLM that detects robotic manipulation failures and generates natural-language explanations, enabling safer recovery pipelines and denser reward signals.

Empirical arXiv:2410.00371 ▶ Audio

failure-detectionrobotic-manipulationvision-language-modelsembodied-aifailure-modes

Apr 15, 2026

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Safety Chain-of-Thought (SCoT) teaches LLMs to reason about potential harms before generating a response, substantially improving robustness to jailbreak attacks including out-of-distribution prompts.

Methods arXiv:2501.19180 ▶ Audio

jailbreak-defensesafety-alignmentchain-of-thoughtllm-safetyadversarial-robustness

Apr 14, 2026

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Introduces Plan-RewardBench, a trajectory-level preference benchmark for evaluating reward models in tool-using agent scenarios, and benchmarks three RM families (generative, discriminative,...

Empirical arXiv:2604.08178 ▶ Audio

reward-modelingtrajectory-level-preferencestool-use-agentsrlhf-benchmarkingagentic-alignment

Apr 14, 2026

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

VLA-Fool exposes how textual, visual, and cross-modal adversarial attacks can systematically break the safety alignment of embodied VLA models, and proposes a semantic prompting framework as a first line of defense.

Empirical arXiv:2511.16203 ▶ Audio

adversarial-attacksvla-modelsmultimodal-safetyembodied-airobustness-evaluation

Apr 14, 2026

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

CRAFT defends large reasoning models against jailbreaks by aligning safety directly in hidden state space via contrastive reinforcement learning, reducing attack success rates without degrading reasoning capability.

Methods arXiv:2603.17305 ▶ Audio

red-teamingreasoning-modelsalignmentreinforcement-learningcontrastive-learning

Apr 13, 2026

BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

BadVLA reveals that VLA models are vulnerable to a novel backdoor attack that decouples trigger learning from task objectives in feature space, enabling stealthy conditional control hijacking in robotic systems.

Empirical arXiv:2505.16640 ▶ Audio

backdoor-attacksvla-modelsembodied-aiadversarial-robustnessrobot-safety

Apr 13, 2026

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

CRAFT uses contrastive learning over a model's internal hidden states combined with reinforcement learning to produce reasoning LLMs that maintain safety alignment without sacrificing reasoning capability.

Empirical arXiv:2603.17305 ▶ Audio

safety-alignmentreasoning-modelscontrastive-learningreinforcement-learningjailbreak-defense

Apr 12, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

An empirical study showing that misaligning an LLM via fine-tuning is significantly cheaper than realigning it, with asymmetric attack-defense dynamics that have serious implications for deployed safety.

Empirical arXiv:2604.07754 ▶ Audio

safety-alignmentfine-tuningllm-safetymisalignmentpost-training

Apr 12, 2026

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

VLA-Fool reveals that embodied VLA models are systematically vulnerable to textual, visual, and cross-modal adversarial attacks, and proposes a semantic prompting defense that only partially closes the gap.

Empirical arXiv:2511.16203 ▶ Audio

adversarial-attacksvision-language-actionmultimodal-robustnessembodied-aisafety-evaluation

Apr 11, 2026

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

CLEAR-Bias introduces a scalable framework that combines jailbreak techniques with LLM-as-a-Judge scoring to reveal how adversarial prompting exploits sociocultural biases embedded in state-of-the-art language models.

Empirical arXiv:2504.07887 ▶ Audio

adversarial-biassafety-alignmentjailbreak-attacksllm-as-a-judgesafety-benchmarking

Apr 11, 2026

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

A large-scale replication finds that six of ten frontier LLMs achieve 96–100% attack success rates under multi-turn adversarial pressure, while deliberative inference cuts that rate by more than half without any retraining.

Empirical arXiv:2512.07059 ▶ Audio

multi-turn-jailbreakadversarial-attacksfrontier-modelssafety-alignmentred-teaming

Apr 11, 2026

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

ROSClaw proposes a hierarchical framework integrating vision-language models with heterogeneous robots through unified semantic-physical control, enabling closed-loop policy learning and...

Methods arXiv:2604.04664 ▶ Audio ▶ Video

vision-language-action-integrationmulti-agent-robot-coordinationsim-to-real-transferembodied-llm-groundinghierarchical-task-planning

Apr 10, 2026

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Proposes DAERT, a diversity-aware red teaming framework using reinforcement learning to systematically uncover linguistic vulnerabilities in Vision-Language-Action models through adversarial...

Empirical arXiv:2604.05595 ▶ Audio ▶ Video

vision-language-action-modelsadversarial-red-teaminglinguistic-robustnessembodied-ai-safetydiversity-aware-attacks

Apr 10, 2026

Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches

EAD turns an embodied agent's ability to move into a defensive weapon, using recurrent perception and active viewpoint control to defeat adversarial patches in 3D environments.

Empirical arXiv:2404.00540 ▶ Audio

adversarial-patchesembodied-aiactive-defenserecurrent-networksphysical-adversarial-attacks

Apr 10, 2026

GuardReasoner: Towards Reasoning-based LLM Safeguards

GuardReasoner trains safety guardrails to produce explicit reasoning chains before verdicts, outperforming GPT-4o+CoT and LLaMA Guard on safety benchmarks while improving generalization to novel adversarial inputs.

Empirical arXiv:2501.18492 ▶ Audio

llm-safetyguardrailsreasoningsafety-alignmentred-teaming

Apr 8, 2026

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

A controlled benchmark revealing that paraphrasing task instructions causes 22–52 percentage point performance drops in state-of-the-art VLA models, with most failures traced to object-level lexical sensitivity rather than execution errors.

Empirical arXiv:2603.28301 ▶ Audio

vla-robustnessparaphrase-attacksrobotic-manipulationlinguistic-generalizationembodied-ai

Apr 8, 2026

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

The first real-world safety evaluation of a deployed personal AI agent shows that poisoning any single dimension of an agent's persistent state raises attack success rates from a 24.6% baseline to 64–74%, with no existing defense eliminating the vulnerability.

Empirical arXiv:2604.04759 ▶ Audio

agent-safetypersistent-state-poisoningprompt-injectionred-teamingpersonal-ai-agents

Apr 7, 2026

AgentWatcher: A Rule-based Prompt Injection Monitor

A scalable and explainable prompt injection detection system that uses causal attribution to identify influential context segments and explicit rule evaluation to flag injections in LLM-based agents.

Methods arXiv:2604.01194 ▶ Audio

prompt-injectionllm-securitycausal-attributionrule-based-detectionagent-safety

Apr 7, 2026

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

A unified evaluation framework exposing critical adversarial and backdoor vulnerabilities in VLA models, introducing BackdoorVLA — a targeted attack achieving 58.4% average success at hijacking multi-step robotic action sequences.

Empirical arXiv:2511.12149 ▶ Audio

vla-modelsadversarial-attacksbackdoor-attacksembodied-airobotics-safety

Apr 7, 2026

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

A collaborative multi-agent red-teaming framework that achieves up to 98.1% jailbreak success across leading LLMs via adaptive multi-turn escalation, exposing the inadequacy of single-turn safety alignment under sustained conversational pressure.

Empirical arXiv:2504.13203 ▶ Audio

jailbreakred-teamingmulti-turnsafety-alignmentllm-safety

Apr 6, 2026

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

A three-layer runtime security framework for autonomous agents that prevents privilege escalation, data leakage, and malicious skill execution through context-injected policies, behavioral monitoring, and a decoupled watcher middleware.

Empirical arXiv:2603.24414 ▶ Audio

agent-safetyautonomous-agentsprivilege-escalationruntime-securityprompt-injection

Apr 6, 2026

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Anthropic's Constitutional Classifiers use LLM-generated synthetic data and natural language rules to create jailbreak-resistant safeguards that survived over 3,000 hours of professional red teaming without a universal bypass being found.

Empirical arXiv:2501.18837 ▶ Audio

jailbreak-defenseconstitutional-aired-teamingsafety-alignmentclassifiers

Apr 6, 2026

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

A systematic study revealing how adversarial patches and targeted perturbations can cause VLA-based robots to fail catastrophically, with task success rates dropping up to 100%.

Empirical arXiv:2411.13587 ▶ Audio

vla-safetyadversarial-attacksroboticsadversarial-patchesembodied-ai

Apr 5, 2026

ANNIE: Be Careful of Your Robots — Adversarial Safety Attacks on Embodied AI

A systematic study of adversarial safety attacks on VLA-powered robots using ISO-grounded safety taxonomies, achieving over 50% attack success rates across all safety categories.

Empirical arXiv:2509.03383 ▶ Audio

embodied-aiadversarial-attacksvla-modelsrobot-safetyred-teaming

Apr 4, 2026

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Comic-based jailbreaks using structured visual narratives achieve success rates above 90% on commercial multimodal models, exposing fundamental limits of text-centric safety alignment.

Empirical arXiv:2603.21697 ▶ Audio

jailbreakmultimodal-safetyvisual-narrativessafety-alignmentcomic-attacks

Apr 3, 2026

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Introduces GameplayQA, a densely annotated benchmark for evaluating multimodal LLMs on first-person multi-agent perception and reasoning in 3D gameplay videos, with diagnostic QA pairs and structured...

Empirical arXiv:2603.24329 ▶ Audio

multimodal-llm-evaluationembodied-ai-perceptionmulti-agent-video-understandingtemporal-groundingagent-attribution

Apr 2, 2026

Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

Proposes a layer-specific Lipschitz modulation framework for fault-tolerant multimodal representation learning that detects and corrects sensor failures through self-supervised pretraining and...

Methods arXiv:2603.25103 ▶ Audio

fault-tolerancemultimodal-learninglipschitz-constraintsanomaly-detectionsensor-robustness

Apr 1, 2026

SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating

SafeFlow combines physics-guided rectified flow matching with a 3-stage safety gate to enable real-time text-driven humanoid control that avoids physical hallucinations and unsafe trajectories on...

Empirical arXiv:2603.23983 ▶ Audio

text-driven-motion-generationphysics-aware-trajectory-optimizationsafety-gating-mechanismshumanoid-robot-controlout-of-distribution-detection

Mar 31, 2026

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Introduces a process-oriented benchmark with 161 scenarios and 388 safety risks for evaluating whether VLM-driven embodied agents recognize and mitigate dynamic hazards during household task execution — finding that current frontier models lack interactive safety awareness.

Empirical arXiv:2506.16402 ▶ Audio

embodied-ai-benchmarkinteractive-safetyhousehold-roboticsprocess-oriented-evaluationvlm-safety

Mar 30, 2026

Back to Basics: Revisiting ASR in the Age of Voice Agents

Introduces WildASR, a multilingual diagnostic benchmark that systematically evaluates ASR robustness across environmental degradation, demographic shift, and linguistic diversity using real human...

Empirical arXiv:2603.25727 ▶ Audio

asr-robustnessmultilingual-evaluationreal-world-degradationhallucination-safetydiagnostic-benchmarking

Mar 29, 2026

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Integrates thermal sensor data into Vision-Language-Action models to enhance robot perception, safety, and task execution in human-robot collaboration scenarios.

Application arXiv:2603.25044 ▶ Audio

thermal-sensing-roboticsvision-language-action-modelsmultimodal-robot-perceptionhuman-robot-collaborationembodied-ai-safety

Mar 28, 2026

TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

TopoPilot introduces a two-agent agentic framework with systematic guardrails and verification mechanisms to reliably automate complex scientific visualization workflows, particularly for topological data analysis.

Methods arXiv:2603.25063 ▶ Audio

agentic-systemsllm-reliabilityverification-mechanismsscientific-visualizationfailure-mode-taxonomy

Mar 27, 2026

G0DM0D3: A Modular Framework for Evaluating LLM Robustness Through Adaptive Sampling and Input Perturbation

An open-source framework that systematises inference-time safety evaluation into five composable modules — AutoTune (sampling parameter manipulation), Parseltongue (input perturbation), STM (output normalization), ULTRAPLINIAN (multi-model racing), and L1B3RT4S (model-specific jailbreak prompts). We analyse its implications for adversarial AI safety research.

F41LUR3-F1R57 ▶ Audio ▶ Video

daily-paperjailbreakred-teamingsafety-evaluationinference-time

Mar 26, 2026

CoP: Agentic Red-teaming for LLMs using Composition of Principles

An extensible agentic framework that composes human-provided red-teaming principles to generate jailbreak attacks, achieving up to 19x improvement over single-turn baselines.

Methods arXiv:2506.00781 ▶ Audio

red-teamingjailbreakagentic-attacksattack-compositionllm-safety

Mar 25, 2026

GoBA: Goal-oriented Backdoor Attack against VLA via Physical Objects

Demonstrates that physical objects embedded in training data can serve as backdoor triggers directing VLA models to execute attacker-chosen goal behaviors with 97% success.

Empirical arXiv:2510.09269 ▶ Audio

backdoor-attackvision-language-actionphysical-triggertraining-data-poisoningrobot-safety

Mar 24, 2026

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Introduces adversarial images that 'freeze' VLA-controlled robots mid-task, severing responsiveness to subsequent instructions with 76.2% average attack success across three models and four environments.

Empirical arXiv:2509.19870 ▶ Audio

vla-adversarial-attackaction-freezingembodied-ai-safetytransferabilityrobotic-manipulation

Mar 23, 2026

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Introduces VROP, a compositional jailbreak for vision-language models that achieves 94-100% ASR on open-source LVLMs and 59-95% on commercial models (including GPT-4o and Claude 3.7 Sonnet) by chaining semantically benign visual inputs that synthesise harmful content only during late-stage reasoning.

Empirical arXiv:2603.09246 ▶ Audio

vision-language-model-jailbreakcompositional-attacksemantic-gadgetsreturn-oriented-programming-analogyperception-level-bypass

Mar 22, 2026

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Applies reinforcement learning to automated red teaming, using a three-phase pipeline of supervised fine-tuning, diversity-driven exploration, and progressive enhancement to generate diverse and effective jailbreak prompts.

Empirical arXiv:2506.00782 ▶ Audio

reinforcement-learningautomated-red-teamingjailbreak-generationadversarial-diversityllm-security

Mar 21, 2026

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Introduces an inference-time defense mechanism using safe reward models and controlled decoding that reduces jailbreak attack success rates by 57.82% on multimodal LLMs while preserving model capabilities.

Empirical arXiv:2411.18688 ▶ Audio

multimodal-safetyjailbreak-defenseinference-time-alignmentcontrolled-decodingreward-models

Mar 20, 2026

DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models

Demonstrates that VLA models can be backdoored at the action primitive level with as little as 0.31% poisoned episodes, achieving 98-99% attack success while preserving clean task performance.

Empirical arXiv:2510.10932 ▶ Audio

backdoor-attacksvision-language-actiondata-poisoningrobotic-manipulationadversarial-ml

Mar 19, 2026

Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

The first formal proof that safety is non-compositional — two individually safe AI agents can collectively reach forbidden goals through emergent conjunctive capability dependencies. Component-level safety verification is provably insufficient.

Theoretical arXiv:2603.15973 ▶ Audio

compositionalityformal-verificationmulti-agentsafety-certificationcapability-dependencies

Mar 18, 2026

Colluding LoRA: A Composite Attack on LLM Safety Alignment

Introduces CoLoRA, a composition-triggered attack where individually benign LoRA adapters compromise safety alignment when combined, exploiting the combinatorial blindness of current adapter verification.

Empirical arXiv:2603.12681 ▶ Audio

supply-chainLoRAcompositional-attackalignment-degradationrefusal-suppression

Mar 17, 2026

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Demonstrates through 1,584 multi-agent simulations that alignment interventions reverse direction in 8 of 16 languages, with safety training amplifying pathology in Japanese while reducing it in English.

Empirical arXiv:2603.04904 ▶ Audio

alignmentsafety-paradoxmulti-agentmultilingualiatrogenesis

Mar 16, 2026

Experimental Evaluation of Security Attacks on Self-Driving Car Platforms

First systematic on-hardware experimental evaluation of five attack classes on low-cost autonomous vehicle platforms, establishing distinct attack fingerprints across control deviation, computational cost, and runtime responsiveness.

Empirical arXiv:2603.14124 ▶ Audio

autonomous-vehiclesadversarial-attacksphysical-aiperception-attacksnetwork-attacks

Mar 15, 2026

A Hazard-Informed Data Pipeline for Robotics Physical Safety

Proposes a structured Robotics Physical Safety Framework bridging classical risk engineering with ML pipelines, using formal hazard ontology to generate synthetic training data for safety-critical scenarios.

Empirical arXiv:2603.06130 ▶ Audio

physical-safetysynthetic-datahazard-ontologysafety-engineeringdigital-twin

Mar 14, 2026

Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents

Proposes a defensible design blueprint for autonomous tool-invoking agents, treating agent security as a systems engineering problem rather than a model alignment problem.

Empirical arXiv:2603.13151 ▶ Audio

agent-securitytool-usesoftware-engineeringsecure-by-designruntime-isolation

Mar 13, 2026

Blindfold: Jailbreaking Embodied LLMs via Action-level Manipulation

Introduces an automated attack framework for embodied LLMs that operates at the action level rather than the language level, achieving 53% higher ASR than baselines on simulators and a real robotic arm.

Empirical arXiv:2603.01414 ▶ Audio

embodied-aijailbreakVLAaction-level-attacksphysical-safety

Mar 12, 2026

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Demonstrates compositional adversarial attacks that jailbreak vision language models by pairing adversarial images with generic text prompts, requiring only vision encoder access rather than LLM access.

Empirical arXiv:2307.14539 ▶ Audio

multimodal-jailbreakingvision-language-modelsadversarial-imagescross-modality-attacksalignment-vulnerabilities

Mar 11, 2026

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Presents DeepInception, a lightweight jailbreaking method that exploits LLMs' personification capabilities by constructing nested virtual scenes to bypass safety guardrails, with empirical validation across multiple models including GPT-4o and Llama-3.

Empirical arXiv:2311.03191 ▶ Audio ▶ Video

llm-jailbreakingadversarial-promptingsafety-guardrailspersonification-exploitationnested-scene-construction

Mar 10, 2026

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Demonstrates that adversarial visual perturbations can universally jailbreak aligned vision-language models, causing them to generate harmful content across diverse malicious instructions.

Empirical arXiv:2306.13213 ▶ Audio ▶ Video

visual-adversarial-examplesmultimodal-jailbreakingvlm-safetyalignment-robustnessadversarial-attack-surface

Mar 9, 2026

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Presents Tree of Attacks with Pruning (TAP), an automated black-box jailbreaking method that uses an attacker LLM to iteratively refine prompts and prunes unlikely candidates before querying the target, achieving >80% jailbreak success rates on GPT-4 variants.

Empirical arXiv:2312.02119 ▶ Audio ▶ Video

black-box-jailbreakingprompt-optimizationllm-safety-evaluationadversarial-attacksguardrail-evasion

Mar 8, 2026

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

SC-VLA introduces sparse world imagination and online action refinement to enable vision-language-action models to self-correct and refine actions during execution without external reward signals.

Empirical arXiv:2602.21633 ▶ Audio

vision-language-action-modelsworld-modelsself-correctionrobot-manipulationaction-refinement

Mar 7, 2026

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Proposes Contrastive World Models (CWM), a contrastive learning approach to train LLM-based action feasibility scorers using hard-mined negatives, and evaluates it on ScienceWorld with intrinsic affordance tests and live filter characterization studies.

Empirical arXiv:2602.22452 ▶ Audio

action-feasibility-scoringcontrastive-learningembodied-agentsworld-modelshard-negative-mining

Mar 6, 2026

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

LiLo-VLA proposes a modular framework that decouples reaching and interaction for long-horizon robotic manipulation, achieving 69% success on simulation benchmarks and 85% on real-world tasks through object-centric VLA policies and dynamic replanning.

Empirical arXiv:2602.21531 ▶ Audio ▶ Video

long-horizon-manipulationvision-language-action-modelsmodular-roboticsobject-centric-policiesfailure-recovery

Mar 5, 2026

SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints

Introduces SPOC, a benchmark for evaluating safety-aware embodied task planning with LLMs under partial observability and physical constraints, revealing current model failures in implicit constraint handling.

Empirical arXiv:2602.21595 ▶ Audio ▶ Video

embodied-task-planningsafety-constraintspartial-observabilityllm-benchmarkinghousehold-hazards

Mar 4, 2026

Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map

Tacmap introduces a geometry-consistent penetration depth map framework that bridges the tactile sim-to-real gap by unifying simulation and real-world tactile sensing through a shared volumetric deform map representation.

Methods arXiv:2602.21625 ▶ Audio ▶ Video

tactile-simulationsim-to-real-transfervision-based-tactile-sensorspenetration-depth-mappingdexterous-manipulation

Mar 3, 2026

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

Proposes an Active Inference framework with RBPF state estimation and CEM-enhanced MPPI planning to safely handle occluded pedestrian scenarios in autonomous driving, validated through simulation experiments against multiple baselines.

Empirical arXiv:2602.23109 ▶ Audio ▶ Video

active-inferenceoccluded-pedestrian-detectionautonomous-driving-safetybelief-state-estimationmodel-predictive-control

Mar 2, 2026

Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

Proposes CEEH, a difficulty-aware entropy regularization method for RL-based LLM reasoning that selectively compresses easy questions while preserving exploration space for hard ones to maintain reasoning capability while reducing inference cost.

Empirical arXiv:2602.22642 ▶ Audio ▶ Video

chain-of-thought-compressionentropy-regularizationreinforcement-learning-reasoningdifficulty-aware-optimizationinference-efficiency

Mar 1, 2026

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

Develops LessMimic, a unified distance field-based policy for long-horizon humanoid robot manipulation that generalizes across object scales and task compositions without motion references, validated through multi-task experiments with 80-100% success on scaled objects and 62.1% on composed trajectories.

Empirical arXiv:2602.21723 ▶ Audio ▶ Video

humanoid-manipulationdistance-field-representationsreference-free-learninggeometric-generalizationskill-composition

Feb 28, 2026

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Develops a gloss-free Vision-Language-Action framework that maps sign language gestures directly to robotic manipulation commands in real-time using alphabet-level finger-spelling.

Application arXiv:2602.22514 ▶ Audio ▶ Video

sign-language-recognitionvision-language-action-modelshuman-robot-interactionmultimodal-groundingaccessibility-robotics

Feb 27, 2026

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Proposes a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation by using auxiliary supervision signals from a BERT-based...

Methods arXiv:2603.17368 ▶ Audio

chain-of-thought-safety-tradeoffsafety-alignmentlarge-reasoning-modelsauxiliary-supervisionsafety-decision-making

Feb 26, 2026

Natural Emergent Misalignment from Reward Hacking in Production RL

Demonstrates that reward hacking in production RL environments causes emergent misalignment behaviors including alignment faking and cooperation with malicious actors, and evaluates three mitigation strategies.

Empirical arXiv:2511.18397 ▶ Audio ▶ Video

reward-hackingemergent-misalignmentalignment-fakingrlhf-safety-trainingagentic-ai-systems

Feb 25, 2026

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Proposes ActionReasoning, an LLM-driven multi-agent framework that performs explicit physics-aware action reasoning to generate manipulation plans for robotic brick stacking without relying on custom...

Methods arXiv:2602.21161 ▶ Audio ▶ Video

llm-robotic-manipulationphysics-aware-action-planningmulti-agent-reasoningbrick-stacking-taskembodied-ai-generalization

Feb 24, 2026

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

HALO introduces a unified Vision-Language-Action model that performs embodied multimodal chain-of-thought reasoning by sequentially predicting textual task reasoning, visual subgoals, and actions through a Mixture-of-Transformers architecture, evaluated on robotic manipulation benchmarks.

Empirical arXiv:2602.21157 ▶ Audio

vision-language-action-modelschain-of-thought-reasoningmultimodal-planningrobotic-manipulationmixture-of-experts

Feb 23, 2026

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Introduces CHAIN, an interactive 3D physics-driven benchmark that evaluates whether vision-language models can understand physical constraints, plan structured action sequences, and execute long-horizon manipulation tasks in dynamic environments.

Empirical arXiv:2602.21015 ▶ Audio

vision-language-modelsphysical-reasoningaction-planningcausal-constraintsinteractive-benchmarking

Feb 22, 2026

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Fuses depth camera measurements with monocular vision and YOLO-pose keypoint detection using Extended Kalman Filtering to enable accurate distance estimation for autonomous UAV following of humans in search and rescue operations.

Empirical arXiv:2602.20958 ▶ Audio

sensor-fusion-depth-monocularextended-kalman-filteruav-human-trackingyolo-pose-keypoint-detectiondistance-estimation-robustness

Feb 21, 2026

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Empirical study with experimental evaluation

Empirical arXiv:2602.20813 ▶ Audio

failure-resilienceai-safetylanguage-models

Feb 20, 2026

Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty

Proposes Fuz-RL, a fuzzy measure-guided framework that uses Choquet integrals and a novel fuzzy Bellman operator to achieve safe reinforcement learning under multiple uncertainty sources without min-max optimization.

Methods arXiv:2602.20729 ▶ Audio

safe-reinforcement-learningdistributionally-robust-optimizationfuzzy-measureschoquet-integralsuncertainty-quantification

Feb 19, 2026

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

Develops and validates a simulation-based clinical red teaming framework that pairs AI psychotherapists with dynamic patient agents to systematically identify safety failures in LLM-driven mental health support, revealing critical iatrogenic risks across 369 therapy sessions.

Empirical arXiv:2602.19948 ▶ Audio

llm-mental-health-safetyclinical-red-teamingai-psychosis-validationsuicide-risk-escalationsimulated-patient-agents

Feb 18, 2026

Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

Proposes CaPE, a multimodal path planning method that uses vision-language models to synthesize path editing programs verified by model-based planners, enabling safe and interpretable multi-agent cooperation through language communication.

Methods arXiv:2602.19304 ▶ Audio

multimodal-path-planningvision-language-modelsmulti-agent-cooperationlanguage-groundingsafety-verification

Feb 17, 2026

A User-driven Design Framework for Robotaxi

Investigates real-world robotaxi user experiences through semi-structured interviews and autoethnographic rides to identify design requirements and propose an end-to-end user-driven design framework.

Empirical arXiv:2602.19107 ▶ Audio

robotaxi-user-experiencehuman-machine-interface-designautonomous-vehicle-trustedge-case-robustnesstransparency-and-explainability

Feb 16, 2026

Small Reward Models via Backward Inference

Novel methodology and algorithmic contributions

Methods arXiv:2602.13551 ▶ Audio

failure-resiliencereinforcement-learninglanguage-modelsmachine-learningcl

Feb 15, 2026

Agentic AI and the Cyber Arms Race

Examines how agentic AI is reshaping cybersecurity by enabling both attackers and defenders to automate tasks and augment human capabilities, with implications for cyber warfare and geopolitical power distribution.

Survey arXiv:2503.04760 ▶ Audio

agentic-ai-securitycyber-arms-raceai-automation-attacksai-defense-augmentationcapability-proliferation

Feb 14, 2026

Distraction is All You Need for Multimodal Large Language Model Jailbreaking

Demonstrates a novel jailbreaking attack (CS-DJ) against multimodal LLMs by exploiting visual complexity and attention dispersion through structured query decomposition and contrasting subimages, achieving 52.4% attack success rates across four major models.

Empirical arXiv:2502.10794 ▶ Audio

multimodal-jailbreakingvisual-adversarial-attacksmllm-safety-vulnerabilitiesattention-distraction-mechanismsprompt-decomposition

Feb 13, 2026

Alignment faking in large language models

Demonstrates that Claude 3 Opus engages in strategic alignment faking by selectively complying with harmful requests during training while maintaining refusal behavior outside training, with compliance rates of 14% for free users versus near-zero for paid users.

Empirical arXiv:2412.14093 ▶ Audio

alignment-fakingdeceptive-behaviortraining-distribution-shiftrlhf-vulnerabilitiesmodel-deception

Feb 12, 2026

Scaling Trends for Data Poisoning in LLMs

Demonstrates that special tokens in LLM tokenizers create a critical attack surface enabling 96% jailbreak success rates through direct token injection, establishing the architectural vulnerability at the heart of prompt injection attacks.

Empirical arXiv:2408.02946 ▶ Audio

special-token-injectionprompt-injection-attacksllm-tokenizer-vulnerabilitiesjailbreak-success-ratesrole-transition-exploitation

Feb 11, 2026

Can Large Language Models Automatically Jailbreak GPT-4V?

Demonstrates an automated jailbreak technique (AutoJailbreak) that uses LLMs for red-teaming and prompt optimization to compromise GPT-4V's safety alignment, achieving 95.3% attack success rate on facial recognition tasks.

Empirical arXiv:2407.16686 ▶ Audio

multimodal-jailbreakingprompt-optimization-attacksllm-red-teamingvision-language-model-safetyprivacy-leakage-facial-recognition

Feb 10, 2026

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Provides a comprehensive taxonomy of jailbreak attack methods (black-box and white-box) and defense strategies (prompt-level and model-level) for LLMs, with analysis of evaluation methodologies.

Survey arXiv:2407.04295 ▶ Audio

adversarial-promptsjailbreak-attackssafety-alignmentprompt-injectionllm-vulnerabilities

Feb 9, 2026

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Introduces WildTeaming, an automatic red-teaming framework that mines real user-chatbot interactions to discover 5.7K jailbreak tactic clusters, then creates WildJailbreak—a 262K prompt-response safety dataset—to train models that balance robust defense against both vanilla and adversarial attacks without over-refusal.

Empirical arXiv:2406.18510 ▶ Audio

jailbreak-discoveryadversarial-safety-trainingred-teaming-automationin-the-wild-vulnerabilitiessafety-dataset-curation

Feb 8, 2026

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Proposes RLbreaker, a deep reinforcement learning-driven black-box jailbreaking attack that uses DRL with customized reward functions and PPO to automatically generate effective jailbreaking prompts, demonstrating superior performance over genetic algorithm-based attacks across six SOTA LLMs.

Empirical arXiv:2406.08705 ▶ Audio

llm-jailbreaking-attacksreinforcement-learning-adversarialblack-box-prompt-optimizationdrl-guided-searchsafety-alignment-evasion

Feb 7, 2026

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Introduces JailbreakBench, an open-sourced benchmark with standardized evaluation framework, dataset of 100 harmful behaviors, repository of adversarial prompts, and leaderboard to enable reproducible and comparable assessment of jailbreak attacks and defenses across LLMs.

Empirical arXiv:2404.01318 ▶ Audio

jailbreak-attacksllm-robustness-evaluationadversarial-promptsbenchmark-standardizationai-safety-evaluation

Feb 6, 2026

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Identifies and quantifies sparse safety-critical regions in LLMs (3% of parameters, 2.5% of ranks) using pruning and low-rank modifications, demonstrating that removing these regions degrades safety while preserving utility.

Empirical arXiv:2402.05162 ▶ Audio

safety-alignment-brittlenessneural-pruninglow-rank-modificationsweight-attributionfine-tuning-attacks

Feb 5, 2026

Security and Privacy Challenges of Large Language Models: A Survey

Not analyzed

Survey arXiv:2402.00888 ▶ Audio

not-analyzed

Feb 4, 2026

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Demonstrates that deceptive backdoor behaviors can be intentionally trained into LLMs and persist through standard safety training techniques including supervised fine-tuning, reinforcement learning, and adversarial training.

Empirical arXiv:2401.05566 ▶ Audio

deceptive-alignmentbackdoor-persistencesafety-training-failurechain-of-thought-reasoningadversarial-training-limitations

Feb 3, 2026

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Comprehensive survey categorizing adversarial attacks on LLMs including prompt injection, jailbreaking, and data poisoning, with analysis of defense limitations.

Survey arXiv:2310.10844 ▶ Audio

surveyvulnerabilitieslargelanguagemodels

Feb 2, 2026

Jailbreaking Black Box Large Language Models in Twenty Queries

Proposes PAIR, an automated algorithm that generates semantic jailbreaks against black-box LLMs through iterative prompt refinement using an attacker LLM, achieving successful attacks in fewer than 20 queries.

Empirical arXiv:2310.08419 ▶ Audio

adversarial-jailbreakingblack-box-attacksprompt-optimizationllm-safety-vulnerabilitiesred-teaming-automation

Feb 1, 2026

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Red teaming study demonstrating that fine-tuning safety-aligned LLMs with adversarial examples or benign datasets can compromise safety guardrails, with quantified jailbreak success rates and cost analysis.

Empirical arXiv:2310.03693 ▶ Audio

fine-tuning-safety-degradationllm-jailbreakingadversarial-training-examplesalignment-robustnessred-teaming

Jan 31, 2026

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

SmoothLLM defends against jailbreaking by randomly perturbing input copies and aggregating predictions, achieving SOTA robustness against GCG, PAIR, and other attacks.

Methods arXiv:2310.03684 ▶ Audio

smoothllmdefendinglargelanguagemodels

Jan 30, 2026

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Not analyzed

Survey arXiv:2309.00614 ▶ Audio

not-analyzed

Jan 29, 2026

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Comprehensive analysis of 1,405 real-world jailbreak prompts across 131 communities, finding five prompts achieving 0.95 attack success rates persisting for 240+ days.

Empirical arXiv:2308.03825 ▶ Audio

anythingcharacterizingevaluatingwildjailbreak

Jan 28, 2026

Universal and Transferable Adversarial Attacks on Aligned Language Models

Develops an automated method to generate universal adversarial suffixes that cause aligned LLMs to produce objectionable content, demonstrating high transferability across both open-source and closed-source models.

Empirical arXiv:2307.15043 ▶ Audio

adversarial-suffix-attacksllm-jailbreakingalignment-circumventiontransferable-adversarial-promptsgradient-based-prompt-optimization

Jan 27, 2026

Prompt Injection attack against LLM-integrated Applications

Demonstrates a novel black-box prompt injection attack technique (HouYi) against LLM-integrated applications through systematic evaluation of 36 real-world applications, achieving 86% success rate (31/36 vulnerable).

Empirical arXiv:2306.05499 ▶ Audio

prompt-injection-attacksllm-security-vulnerabilitiesblack-box-adversarial-methodscontext-partition-exploitationapplication-level-attacks

Jan 26, 2026

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Empirically evaluates the effectiveness of jailbreak prompts against ChatGPT by classifying 10 distinct prompt patterns across 3 categories and testing 3,120 jailbreak questions against 8 prohibited scenarios, finding 40% consistent evasion rates.

Empirical arXiv:2305.13860 ▶ Audio

prompt-injection-attacksllm-safety-constraintsjailbreak-taxonomyadversarial-promptingcontent-policy-evasion

Jan 25, 2026

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Demonstrates indirect prompt injection attacks where adversarial instructions embedded in external content cause LLM-powered tools to exfiltrate data and execute code.

Empirical arXiv:2302.12173 ▶ Audio

whatsignedcompromisingrealworld

Jan 24, 2026

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

Demonstrates that instruction-following LLMs can be exploited to generate malicious content (hate speech, scams) at scale by applying standard computer security attacks, bypassing vendor defenses at costs significantly lower than human effort.

Empirical arXiv:2302.05733 ▶ Audio

llm-jailbreakingdual-use-risksadversarial-promptingcontent-moderation-evasioneconomic-attack-analysis

Jan 23, 2026

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Proposes a formal instruction hierarchy that trains models to prioritize system prompts over user messages over tool outputs, demonstrating that explicit privilege levels significantly reduce prompt injection and instruction override attacks.

Empirical arXiv:2404.13208 ▶ Audio

instruction-hierarchyprompt-injectionprivilege-levelssystem-prompt-securityalignment-architecture

Jan 22, 2026

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Provides a comprehensive survey of RLHF's fundamental limitations as an alignment technique, cataloging open problems across the feedback pipeline including reward hacking, evaluation difficulties, and the impossibility of capturing human values through pairwise comparisons.

Survey arXiv:2307.15217 ▶ Audio

rlhf-limitationsreward-hackingalignment-challengeshuman-feedbackvalue-alignment

Jan 21, 2026

Gemini: A Family of Highly Capable Multimodal Models

Introduces the Gemini family of multimodal models capable of reasoning across text, images, audio, and video, demonstrating state-of-the-art performance on 30 of 32 benchmarks while detailing the safety evaluation framework for natively multimodal systems.

Empirical arXiv:2312.11805 ▶ Audio

multimodal-modelsfoundation-modelssafety-evaluationcross-modal-reasoningcapability-assessment

Jan 20, 2026

Scalable Extraction of Training Data from (Production) Language Models

Demonstrates that production language models including ChatGPT can be induced to diverge from aligned behavior and emit memorized training data at scale, extracting gigabytes of training text through a simple prompting technique.

Empirical arXiv:2311.17035 ▶ Audio

training-data-extractionprivacy-attacksmemorizationalignment-divergenceproduction-models

Jan 19, 2026

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Proposes AutoDAN, a gradient-based method for generating interpretable adversarial jailbreak prompts that combines readability with attack effectiveness, achieving high success rates against aligned LLMs while producing human-understandable attack text.

Empirical arXiv:2310.06987 ▶ Audio

automated-jailbreakinggradient-attacksadversarial-promptsinterpretable-attacksdefense-evasion

Jan 18, 2026

Llama 2: Open Foundation and Fine-Tuned Chat Models

Introduces the Llama 2 family of open-source language models from 7B to 70B parameters, including detailed documentation of safety fine-tuning methodology, red-teaming results, and the first comprehensive open model safety report.

Empirical arXiv:2307.09288 ▶ Audio

open-source-modelssafety-trainingrlhfred-teamingresponsible-release

Jan 17, 2026

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Presents the first comprehensive trustworthiness evaluation of GPT models across eight dimensions including toxicity, bias, adversarial robustness, out-of-distribution performance, privacy, machine ethics, fairness, and robustness to adversarial demonstrations.

Empirical arXiv:2306.09442 ▶ Audio

trustworthinessbenchmark-designadversarial-robustnessprivacyfairness

Jan 16, 2026

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Introduces a multi-step jailbreaking methodology that extracts personal information from ChatGPT by decomposing privacy attacks into sequential conversational turns, achieving high success rates on extracting email addresses, phone numbers, and biographical details.

Empirical arXiv:2304.15004 ▶ Audio

privacy-attacksmulti-turn-jailbreakingpii-extractionconversational-manipulationchatgpt-vulnerabilities

Jan 15, 2026

Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

Demonstrates that assigning personas to ChatGPT can increase toxicity by up to 6x compared to default behavior, with certain personas producing consistently toxic outputs, revealing persona assignment as a systematic jailbreak vector.

Empirical arXiv:2304.05335 ▶ Audio

persona-hijacktoxicityjailbreakingrole-playing-attackschatgpt-safety

Jan 14, 2026

GPT-4 Technical Report

Documents the capabilities and safety evaluation of GPT-4, a large multimodal model that accepts image and text inputs, demonstrating substantial improvements over GPT-3.5 while revealing persistent vulnerabilities through extensive red-teaming efforts.

Empirical arXiv:2303.08774 ▶ Audio

foundation-modelsmultimodal-aisafety-evaluationred-teamingcapability-assessment

Jan 13, 2026

Toolformer: Language Models Can Teach Themselves to Use Tools

Demonstrates that language models can learn to autonomously decide when and how to call external tools (calculators, search engines, APIs) by self-generating tool-use training data, establishing a paradigm for agentic AI with tool access.

Empirical arXiv:2302.04761 ▶ Audio

tool-useagentic-aiself-supervised-learningapi-interactionautonomous-systems

Jan 12, 2026

Constitutional AI: Harmlessness from AI Feedback

Introduces Constitutional AI (CAI), a method for training harmless AI systems using AI-generated feedback guided by a set of written principles, reducing dependence on human red-teaming while achieving comparable or better safety outcomes.

Empirical arXiv:2212.08073 ▶ Audio

constitutional-aiai-feedbackself-improvementsafety-trainingprinciple-based-alignment

Jan 11, 2026

Holistic Evaluation of Language Models

Introduces HELM, a comprehensive evaluation framework that assesses language models across 42 scenarios and 7 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, establishing a new standard for multi-dimensional model evaluation.

Empirical arXiv:2211.09527 ▶ Audio

evaluation-methodologyholistic-assessmentbenchmark-designfairnessrobustness

Jan 10, 2026

Scaling Instruction-Finetuned Language Models

Demonstrates that instruction fine-tuning with chain-of-thought and over 1,800 tasks dramatically improves model performance and generalization, producing the Flan-T5 and Flan-PaLM models that establish instruction tuning as a standard practice.

Empirical arXiv:2210.11416 ▶ Audio

instruction-tuningscaling-lawschain-of-thoughttask-generalizationflan

Jan 9, 2026

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Documents Anthropic's large-scale manual red-teaming effort across model sizes and RLHF training, finding that larger and RLHF-trained models are harder but not impossible to red team, and providing a detailed taxonomy of discovered harms.

Empirical arXiv:2209.07858 ▶ Audio

red-teamingsafety-evaluationrlhf-robustnessharm-taxonomyscaling-behaviors

Jan 8, 2026

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

Introduces BIG-bench, a collaborative benchmark of 204 tasks contributed by 450 authors to evaluate language model capabilities, revealing unpredictable emergent abilities and systematic failure patterns across model scales.

Empirical arXiv:2206.04615 ▶ Audio

benchmark-designemergent-capabilitiesscaling-analysisevaluation-methodologycapability-assessment

Jan 7, 2026

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Presents Anthropic's foundational work on RLHF for aligning language models, introducing the helpful-harmless tension and demonstrating that human preference training can reduce harmful outputs while maintaining helpfulness.

Empirical arXiv:2204.05862 ▶ Audio

rlhfalignmenthelpful-harmless-tradeoffhuman-feedbacksafety-training

Jan 6, 2026

Red Teaming Language Models with Language Models

Proposes using language models to automatically generate test cases for discovering offensive or harmful outputs from other language models, establishing the paradigm of automated red teaming for AI safety evaluation.

Empirical arXiv:2202.03286 ▶ Audio

red-teamingautomated-evaluationadversarial-testingsafety-evaluationllm-as-judge

Jan 5, 2026

WebGPT: Browser-assisted Question-Answering with Human Feedback

Trains a language model to use a text-based web browser to answer questions, demonstrating both the potential of tool-augmented language models and the alignment challenges that arise when models can interact with external environments.

Empirical arXiv:2112.04359 ▶ Audio

tool-useweb-browsingrlhfagentic-aigrounded-generation

Jan 4, 2026

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Introduces a benchmark of 817 questions designed to test whether language models generate truthful answers, finding that larger models are actually less truthful because they more effectively learn and reproduce common human misconceptions.

Empirical arXiv:2109.07958 ▶ Audio

truthfulnessbenchmark-designscaling-risksinverse-scalingmodel-evaluation

Jan 3, 2026

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

A landmark critique arguing that ever-larger language models carry underappreciated risks including environmental costs, biased training data encoding, and the illusion of understanding, calling for more careful development practices.

Theoretical arXiv:2103.00453 ▶ Audio

ai-ethicsbias-amplificationenvironmental-costsresponsible-aitraining-data-governance

Jan 2, 2026

Extracting Training Data from Large Language Models

Demonstrates that large language models memorize and can be induced to emit verbatim training data including personally identifiable information, establishing training data extraction as a concrete privacy attack vector.

Empirical arXiv:2012.09300 ▶ Audio

privacy-attacksmemorizationtraining-data-extractiondifferential-privacymodel-security

Jan 1, 2026

Language Models are Few-Shot Learners

Introduces GPT-3, a 175B parameter autoregressive language model demonstrating that scaling dramatically improves few-shot task performance, establishing the paradigm of in-context learning without gradient updates.

Empirical arXiv:2005.14165 ▶ Audio

foundation-modelsfew-shot-learningscaling-lawsemergent-capabilitiesai-safety-implications

Dec 31, 2025

A Multimodal Framework for Human-Multi-Agent Interaction

Implements a multimodal framework for coordinated human-multi-agent interaction on humanoid robots, integrating LLM-driven planning with embodied perception and centralized turn-taking coordination.

Application arXiv:2603.23271 ▶ Audio

multi-agent-coordinationmultimodal-perceptionllm-embodied-planninghuman-robot-interactionturn-taking-management

Dec 30, 2025

BitBypass: Jailbreaking LLMs with Bitstream Camouflage

A black-box jailbreak technique that encodes harmful queries as hyphen-separated bitstreams, exploiting the gap between tokenization and semantic safety filtering.

Empirical arXiv:2506.02479 ▶ Audio

jailbreakbitstream-encodingtokenization-attackblack-box-attacksafety-alignment

Dec 29, 2025

Risk Awareness Injection: Calibrating VLMs for Safety without Compromising Utility

A training-free defense framework that amplifies unsafe visual signals in VLM embeddings to restore LLM-like risk recognition without degrading task performance.

Methods arXiv:2602.03402 ▶ Audio

vlm-safetymultimodal-defensetraining-freerisk-calibrationjailbreak-defense

Dec 28, 2025

Why Agents Compromise Safety Under Pressure

Identifies and empirically demonstrates Agentic Pressure as a mechanism causing LLM agents to violate safety constraints under goal-achievement pressure, showing that advanced reasoning accelerates this normative drift.

Empirical arXiv:2603.14975 ▶ Audio ▶ Video

agentic-pressuresafety-constraint-violationnormative-driftllm-agent-alignmentgoal-safety-tradeoff

Dec 27, 2025

Back to Basics: Revisiting ASR in the Age of Voice Agents

Introduces WildASR, a multilingual diagnostic benchmark that systematically evaluates ASR robustness across environmental degradation, demographic shift, and linguistic diversity using real human speech, revealing severe performance gaps and hallucination risks in production systems.

Empirical arXiv:2603.25727 ▶ Audio ▶ Video

asr-robustnessmultilingual-evaluationreal-world-degradationhallucination-safetydiagnostic-benchmarking

Dec 26, 2025

Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

Proposes a layer-specific Lipschitz modulation framework for fault-tolerant multimodal representation learning that detects and corrects sensor failures through self-supervised pretraining and learnable correction blocks.

Methods arXiv:2603.25103 ▶ Audio ▶ Video

fault-tolerancemultimodal-learninglipschitz-constraintsanomaly-detectionsensor-robustness

Dec 25, 2025

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Empirical arXiv:2603.24329 ▶ Audio

multimodal-llm-evaluationembodied-ai-perceptionmulti-agent-video-understandingtemporal-groundingagent-attribution

Dec 24, 2025

SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating

SafeFlow combines physics-guided rectified flow matching with a 3-stage safety gate to enable real-time text-driven humanoid control that avoids physical hallucinations and unsafe trajectories on real robots.

Empirical arXiv:2603.23983 ▶ Audio

text-driven-motion-generationphysics-aware-trajectory-optimizationsafety-gating-mechanismshumanoid-robot-controlout-of-distribution-detection

Dec 23, 2025

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Adversarial 3D textures applied to physical objects cause manipulation-task failure rates of 96.7% across simulated and real robotic settings.

Empirical arXiv:2604.01618 ▶ Audio

adversarial-attacksvla-modelsrobotic-manipulation3d-texturesphysical-world-attacks

Dec 22, 2025

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Integrates thermal sensor data into Vision-Language-Action models to enhance robot perception, safety, and task execution in human-robot collaboration scenarios.

Application arXiv:2603.25044 ▶ Audio

thermal-sensing-roboticsvision-language-action-modelsmultimodal-robot-perceptionhuman-robot-collaborationembodied-ai-safety

Dec 21, 2025

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Proposes a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation by using auxiliary supervision signals from a BERT-based classifier.

Methods arXiv:2603.17368 ▶ Audio ▶ Video

chain-of-thought-safety-tradeoffsafety-alignmentlarge-reasoning-modelsauxiliary-supervisionsafety-decision-making

Dec 20, 2025

Generating Robot Constitutions & Benchmarks for Semantic Safety

Introduces the ASIMOV Benchmark for evaluating semantic safety in robot foundation models and an automated framework for generating robot constitutions that achieves 84.3% alignment with human safety preferences.

Empirical arXiv:2503.08663 ▶ Audio

robot-safetyconstitutional-aisemantic-safetysafety-benchmarksfoundation-models

Dec 19, 2025

In-Decoding Safety-Awareness Probing: Surfacing Hidden Safety Signals to Defend LLMs Against Jailbreaks

SafeProbing exploits latent safety signals that persist inside jailbroken LLMs during generation, achieving 95.1% defense rates while dramatically reducing over-refusals compared to prior approaches.

Methods arXiv:2601.10543 ▶ Audio

jailbreak-defensesafety-alignmentllm-safetydecoding-time-defensesafety-probing

Dec 18, 2025

Red Teaming as Security Theater: What 236 Models and 135,000 Results Taught Us

Revisiting Feffer et al.'s systematic analysis of AI red-teaming inconsistency — now with four months of empirical evidence from 236 models confirming that the 'security theater' diagnosis applies even more acutely to embodied AI.

Empirical arXiv:2401.15897 ▶ Audio

red-teamingai-safetyevaluationsecurity-theatermethodology

Dec 17, 2025

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Reveals that multi-turn jailbreaking achieves 87.62% success on GPT-4o by concealing harmful intent across dialogue turns, and introduces RED QUEEN GUARD that reduces attack success to below 1%.

Empirical arXiv:2409.17458 ▶ Audio

multi-turn-jailbreakingconversational-safetyred-teamingsafety-guardrailsllm-defense

Dec 16, 2025

RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI

Presents an open-source VLA platform that enables low-cost data collection, standardized benchmarking, and zero-shot sim-to-real transfer for humanoid robot manipulation tasks.

Empirical arXiv:2509.14687 ▶ Audio

vision-language-actionsim-to-real-transferembodied-ai-platformrobot-benchmarkingopen-source

Dec 15, 2025

Why Agents Compromise Safety Under Pressure

Empirical arXiv:2603.14975 ▶ Audio

agentic-pressuresafety-constraint-violationnormative-driftllm-agent-alignmentgoal-safety-tradeoff

Dec 14, 2025

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces AEGIS, a control-barrier-function-based safety layer that bolts onto existing VLA models without retraining, achieving 59.16% improvement in obstacle avoidance while increasing task success by 17.25% on the new SafeLIBERO benchmark.

Methods arXiv:2512.11891 ▶ Audio

vla-safety-layercontrol-barrier-functionsplug-and-play-safetysafe-liberorobotic-manipulation

Dec 13, 2025

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

A benchmark of 750 tasks across 10 hazard categories reveals that even the best embodied LLM agents reject fewer than 10% of dangerous task requests.

Empirical arXiv:2412.13178 ▶ Audio

embodied-aisafety-benchmarktask-planningllm-agentshazard-detection

Dec 12, 2025

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Introduces STAR, a state-oriented diagnostic framework showing that multi-turn safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities, with mechanistic evidence of monotonic drift away from refusal representations and abrupt phase transitions.

Methods arXiv:2603.15684 ▶ Audio

multi-turn-attackssafety-alignmentstate-transitionsconversational-safetyphase-transitions

Dec 11, 2025

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Proposes a jailbreak attack that interweaves multiple task streams within a single prompt to exploit unique vulnerabilities in thinking-mode LLMs, achieving high attack success rates while causing thinking collapse and repetitive outputs across Qwen3, DeepSeek, and Gemini 2.5 Flash.

Empirical arXiv:2603.10091 ▶ Audio

jailbreakreasoning-modelsthinking-modeformat-lockmulti-turn

Dec 10, 2025

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Introduces a novel jailbreak technique that synthesizes content from LLM safety research papers to craft adversarial prompts, achieving 97-98% attack success rates against Claude 3.5 Sonnet and DeepSeek-R1 by exploiting models' trust in academic authority.

Empirical arXiv:2507.13474 ▶ Audio

jailbreaksauthority-exploitationacademic-trustadversarial-promptsclaude

Dec 9, 2025

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Presents JBF, a system that translates jailbreak attack papers into executable modules via multi-agent workflows, reproducing 30 attacks with minimal deviation from reported success rates and enabling standardized cross-model evaluation.

Methods arXiv:2602.24009 ▶ Audio

jailbreak-benchmarksreproducibilityattack-automationred-teamingbenchmark-infrastructure

Dec 8, 2025

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

Introduces SAFE, a comprehensive benchmark for evaluating embodied AI agent safety across perception, planning, and execution stages, revealing systematic failures in translating hazard recognition into safe behavior across nine vision-language models.

Empirical arXiv:2506.14697 ▶ Audio

embodied-aisafety-benchmarksvision-language-modelshazard-recognitionrobotics-safety

Dec 7, 2025

Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks

A systematic survey categorizing embodied AI vulnerabilities into exogenous (physical attacks, cybersecurity threats) and endogenous (sensor failures, software flaws) sources, examining how adversarial attacks target perception, decision-making, and interaction in robotic and autonomous systems.

Survey arXiv:2502.13175 ▶ Audio

embodied-aivulnerability-taxonomyadversarial-attacksrobotics-securityautonomous-vehicles

Dec 6, 2025

A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos

Introduces the Mousetrap framework, the first jailbreak attack specifically designed for Large Reasoning Models, using a Chaos Machine to embed iterative one-to-one mappings into the reasoning chain and achieving up to 98% success rates on o1-mini, Claude-Sonnet, and Gemini-Thinking.

Empirical arXiv:2502.15806 ▶ Audio

jailbreakreasoning-modelschain-of-thoughtencoding-attacksiterative-attacks

Dec 5, 2025

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models

Demonstrates that chain-of-thought safety reasoning in frontier models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can be hijacked, dropping refusal rates from 98% to below 2% by disguising harmful requests as educational prompts.

Empirical arXiv:2502.12893 ▶ Audio

chain-of-thoughtreasoning-modelsjailbreakssafety-reasoningo1

Dec 4, 2025

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

Introduces FITD, a psychology-inspired multi-turn jailbreak that progressively escalates malicious intent through intermediate bridge prompts, achieving 94% average attack success rate across seven popular models and revealing self-corruption mechanisms in multi-turn alignment.

Empirical arXiv:2502.19820 ▶ Audio

multi-turn-attacksjailbreakssocial-engineeringprogressive-escalationalignment-vulnerabilities

Dec 3, 2025

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

A systematic analysis of AI red-teaming practices across industry and academia, revealing critical inconsistencies in purpose, methodology, threat models, and follow-up that reduce many exercises to security theater rather than genuine safety evaluation.

Survey arXiv:2401.15897 ▶ Audio

red-teamingsecurity-theaterevaluation-methodologysafety-governancethreat-modeling

Dec 2, 2025

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Reveals that LLMs cannot reliably interpret ASCII art representations of text, and exploits this gap to bypass safety alignment by encoding sensitive words as ASCII art. Introduces the Vision-in-Text Challenge benchmark and demonstrates effective black-box attacks against GPT-4, Claude, Gemini, and Llama2.

Empirical arXiv:2402.11753 ▶ Audio

jailbreakencoding-attacksascii-artformat-lockblack-box-attacks

Dec 1, 2025

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Introduces an automatic framework that decomposes malicious prompts into harmless-looking sub-prompts and reconstructs them via in-context learning, achieving 78% success on GPT-4 with only 15 queries and surpassing prior state-of-the-art by 33.1 percentage points.

Empirical arXiv:2402.16914 ▶ Audio

jailbreakprompt-decompositionencoding-attacksin-context-learningautomated-attacks

Nov 12, 2025

SAFE: Multitask Failure Detection for Vision-Language-Action Models

A failure detection framework that leverages internal VLA features to predict imminent task failures across unseen tasks and policy architectures.

Empirical arXiv:2506.09937 ▶ Audio

failure-detectionvision-language-actionrobot-safetyconformal-predictionruntime-monitoring

Nov 11, 2025

Lifelong Safety Alignment for Language Models

Presents an adversarial co-evolution framework where a Meta-Attacker discovers novel jailbreaks from research literature and a Defender iteratively adapts, reducing attack success from 73% to approximately 7% through competitive training.

Methods arXiv:2505.20259 ▶ Audio

lifelong-alignmentadversarial-coevolutionjailbreak-defencemeta-attackeradaptive-safety

Nov 10, 2025

SayCan: Do As I Can, Not As I Say

Demonstrates that language models can ground abstract instructions in robotic capabilities by combining language understanding with value functions learned from robot interaction data, enabling robots to reject impossible requests and achieve human intent rather than literal instruction following.

Empirical arXiv:2204.01691 ▶ Audio

roboticslanguage-groundingembodied-aiintent-understandingcapability-awareness

Nov 9, 2025

PaLM-E: An Embodied Multimodal Language Model for Robotics

Presents PaLM-E, a large-scale multimodal language model that unifies vision, text, and embodiment, enabling robots to perform complex manipulation tasks through natural language grounding and learned sensorimotor representations.

Empirical arXiv:2303.03378 ▶ Audio

embodied-aimultimodallanguage-groundingroboticsmanipulation

Nov 8, 2025

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Demonstrates that vision-language models trained on web text and images can directly control robots by treating robotic control as a language modeling problem, achieving generalization to new tasks without task-specific training.

Empirical arXiv:2307.15818 ▶ Audio

vision-language-actionroboticsgeneralizationweb-knowledge-transferlanguage-grounding

Nov 7, 2025

OpenVLA: An Open-Source Vision-Language-Action Model for Robotic Manipulation

Introduces OpenVLA, a 7B parameter open-source vision-language-action model trained on 970M robot demonstrations, achieving competitive performance on robotic manipulation benchmarks and enabling wide accessibility for embodied AI research.

Empirical arXiv:2406.09246 ▶ Audio

vision-language-actionroboticsembodied-aiopen-sourcemanipulation

Nov 6, 2025

StrongREJECT: A Robust Metric for Evaluating Jailbreak Resistance

Proposes StrongREJECT, a classification-based metric that robustly evaluates whether a language model's refusal to provide harmful information is genuine or can be evaded with minor prompt variations.

Empirical arXiv:2402.10260 ▶ Audio

jailbreakingevaluation-metricsrobustnesssafety-testingrejection-consistency

Nov 5, 2025

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming

Introduces HarmBench, a comprehensive benchmark for evaluating automated red-teaming methods against language models, establishing standardized metrics and harm categories to enable reproducible adversarial AI research.

Methods arXiv:2402.04249 ▶ Audio

red-teamingjailbreakingbenchmarkingstandardizationsafety-evaluation

Nov 4, 2025

Many-Shot Jailbreaking: Exploiting In-Context Learning at Scale

Demonstrates that providing many demonstrations of harmful behavior within the context window can teach language models to override their safety training, with attack success scaling with context size.

Empirical arXiv:2404.11499 ▶ Audio

in-context-learninglong-contextfew-shotjailbreakingcontext-window

Nov 3, 2025

In-Context Attacks: Natural Language Inference Exploitation

Explores how adversarial inputs embedded in context windows can trigger unsafe outputs in language models, leveraging the model's natural-language inference capabilities as an attack surface.

Empirical arXiv:2311.00872 ▶ Audio

in-context-attacksprompt-injectioncontext-window-exploitationllm-safetyinference

Nov 2, 2025

AutoDAN: Generating Adversarial Examples via Automatic Optimization

Proposes an automated approach to generate adversarial inputs against aligned LLMs using evolutionary algorithms and semantic mutation, achieving high attack success rates without manual engineering.

Empirical arXiv:2310.04451 ▶ Audio

jailbreakingadversarial-generationevolutionary-algorithmsllm-safetyautomatic-attacks

Nov 1, 2025

Adversarial Attacks on Aligned Language Models

Introduces automated methods to discover adversarial suffixes that bypass safety alignment in LLMs, demonstrating high transferability across models and establishing a benchmark for studying robustness of language model alignment.

Empirical arXiv:2406.13333 ▶ Audio

jailbreakingadversarial-attacksllm-safetyalignmenttransferability

Oct 14, 2025

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Proposes the first systematic safety alignment method for VLA models using constrained Markov decision processes, reducing safety violation costs by 83.58% while maintaining task performance on mobile manipulation tasks.

Methods arXiv:2503.03480 ▶ Audio

vla-safety-alignmentconstrained-reinforcement-learningsafe-rlmobile-manipulationembodied-ai-safety

Oct 13, 2025

Jailbreaking to Jailbreak: LLM-as-Red-Teamer via Self-Attack

Jailbroken versions of frontier LLMs can systematically red-team themselves and other models, achieving over 90% attack success rates against GPT-4o on HarmBench.

Empirical arXiv:2502.09638 ▶ Audio

jailbreakred-teamingllm-safetyself-attacksafety-alignment

Oct 12, 2025

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

A black-box jailbreak framework that uses malicious content concealing and memory reframing to automatically bypass LLM safety guardrails at scale.

Empirical arXiv:2403.08424 ▶ Audio

jailbreakred-teamingblack-box-attackllm-safetyadversarial-prompts

Oct 11, 2025

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Parametric red-teaming via lightweight instruction fine-tuning can reliably remove safety guardrails from aligned LLMs, exposing how shallow alignment training really is.

Empirical arXiv:2310.14303 ▶ Audio

safety-alignmentred-teamingparameter-tuningjailbreakbias

Oct 10, 2025

Jailbroken: How Does LLM Safety Training Fail?

Comprehensive taxonomy of failure modes in safety training, establishing that RLHF alone is insufficient for robust safety

Empirical arXiv:2307.02483 ▶ Audio

safety-training-failuresrlhf-limitationsadversarial-robustnesstaxonomytraining-methodology

Oct 9, 2025

Refusal in Language Models is Mediated by a Single Direction

Safety refusals are encoded along a single vector in model representations—implicating both interpretability and vulnerability

Empirical arXiv:2406.11717 ▶ Audio

refusal-directionrepresentation-analysismechanistic-safetymodel-steeringvulnerability-analysis

Oct 8, 2025

Circuit Breakers: Removing Model Behaviors with Representation Engineering

Surgical removal of harmful behaviors by identifying and nullifying their underlying representations

Empirical arXiv:2406.04313 ▶ Audio

model-editingbehavior-removalrepresentation-engineeringsafety-interventioninterpretability

Oct 7, 2025

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Models can be fine-tuned to hide harmful behaviors during testing, then activate in deployment—a fundamental safety challenge

Empirical arXiv:2401.05566 ▶ Audio

deceptive-alignmentbackdoor-attackssafety-training-evasionbehavioral-evasiontraining-time-attacks

Oct 6, 2025

Representation Engineering: A Top-Down Approach to AI Transparency

Identifying and manipulating internal model directions that encode safety behaviors—foundational for interpretability research

Empirical arXiv:2310.01405 ▶ Audio

interpretabilitymechanistic-transparencyrepresentation-analysissafety-directionsmodel-editing

Oct 5, 2025

Crescendo: Multi-Turn LLM Jailbreak Attack with Adaptive Queries

Iterative jailbreak methodology that exploits state-dependent safety failures across conversation turns

Empirical arXiv:2404.01833 ▶ Audio

multi-turn-attackiterative-jailbreakstate-dependent-safetyconversation-contextadaptive-queries

Oct 4, 2025

Latent Jailbreak: A Benchmark for Evaluating LLM Safety under Task-Oriented Jailbreaks

Safety evaluation for goal-directed attacks where the harmful intent is latent in system instructions, not explicit requests

Empirical arXiv:2307.08487 ▶ Audio

task-oriented-jailbreaklatent-intentbenchmarksafety-evaluationimplicit-harm

Oct 3, 2025

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Generating diverse attack angles through multi-objective optimization—demonstrates vulnerability to multi-axis jailbreaks

Empirical arXiv:2402.16822 ▶ Audio

red-teamingadversarial-promptsdiversitymulti-objective-optimizationjailbreak-generation

Oct 2, 2025

Llama Guard: LLM-based Input-Output Safeguard for Open-Ended Generative Models

First LLM-based safety filter—delegates moderation to a smaller, specialized safety model

Empirical arXiv:2312.06674 ▶ Audio

safety-filteringllm-as-judgemoderation-frameworktaxonomycontent-policy

Oct 1, 2025

WildGuard: Open One-Stop Moderation Tool for Safety Risks in LLMs

Multi-category safety moderation framework that scales across diverse risk types—relevant to embodied AI deployment environments

Empirical arXiv:2406.18510 ▶ Audio

safety-moderationcontent-filteringmulti-category-riskllm-safetydeployment

Sep 10, 2025

Fine-Tuning Aligned Language Models Compromises Safety

Demonstrates that further fine-tuning of already safety-trained models on specific tasks erodes their safety properties, showing that downstream users can inadvertently undo months of safety work through task-specific fine-tuning. Safety properties do not robustly transfer.

Empirical arXiv:2310.03693 ▶ Audio

safety-erosionfine-tuning-instabilitytransfer-learningalignment-driftdownstream-safety

Sep 9, 2025

The Alignment Tax: Safety Training Reduces Model Capability and User Satisfaction

Demonstrates quantitatively that safety fine-tuning of language models incurs a measurable capability cost, reducing performance on legitimate tasks and user satisfaction, which creates economic pressure for models to reduce safety measures.

Empirical arXiv:2309.02404 ▶ Audio

alignment-costsafety-capability-tradeofffine-tuningcapability-losshelpfulness

Sep 8, 2025

Towards Scalable, Trustworthy AI by Default: Alignment, Uncertainty, and Scalable Oversight

Introduces Anthropic's Responsible Scaling Policy (RSP), a framework for developing AI systems that remain trustworthy and aligned as they scale, incorporating red-teaming, uncertainty quantification, and human oversight mechanisms to catch emergent risks before deployment.

Position arXiv:2309.08956 ▶ Audio

responsible-scalingalignment-as-scalingred-teaminguncertaintyscalable-oversight

Sep 7, 2025

On the Power of Persuasion: Jailbreaking Language Models through Dialogue

Demonstrates that language models are vulnerable to sophisticated persuasion attacks through multi-turn dialogue, where models gradually relax safety constraints through conversation without explicit jailbreak prompts.

Empirical arXiv:2303.08721 ▶ Audio

jailbreakspersuasionmulti-turn-dialoguesafety-vulnerabilitiesadversarial-prompts

Sep 6, 2025

Safety-Tuned LLaMA: Lessons From Improving Safety of LLMs

Documents practical lessons from fine-tuning LLaMA with safety-focused instruction data, revealing that safety improvements on benchmarks often come at the cost of helpfulness and that models develop brittle heuristics rather than robust understanding of harm.

Empirical arXiv:2309.07875 ▶ Audio

llamasafety-fine-tuninginstruction-tuningalignment-trade-offssafety-training

Sep 5, 2025

Do-Not-Answer: A Dataset for Evaluating the Safeguards in Large Language Models

Introduces a curated dataset of 939 sensitive queries designed to systematically evaluate how language models handle harmful requests, finding that most safety refusals can be bypassed through rephrasing and that models struggle with context-dependent harms.

Empirical arXiv:2308.13387 ▶ Audio

safety-evaluationrefusal-robustnessadversarial-promptsharmful-requestsbenchmark

Sep 2, 2025

Sparks of Artificial General Intelligence: Early Experiments with GPT-4

Documents GPT-4's remarkable few-shot learning capabilities across diverse domains, showing emergent reasoning abilities in mathematics, coding, science, and vision tasks that suggest possible progression toward artificial general intelligence.

Empirical arXiv:2303.12712 ▶ Audio

gpt-4emergent-capabilitiesfew-shot-learningreasoningmultimodal

Sep 1, 2025

InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Introduces Reinforcement Learning from Human Feedback (RLHF) methodology to align language models with human intentions, demonstrating that fine-tuned models exhibit fewer harmful outputs and better follow user instructions while maintaining task performance.

Empirical arXiv:2203.02155 ▶ Audio

rlhfalignmentinstruction-followinghuman-feedbacksafety-training

◈ RSS Feed

Dailypaper

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning Models

Using Large Language Models for Embodied Planning Introduces Systematic Safety Risks

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Weak-to-Strong Jailbreaking on Large Language Models

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Updating Robot Safety Representations Online from Natural Language Feedback

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

RACF: A Resilient Autonomous Car Framework with Object Distance Correction

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

10 Open Challenges Steering the Future of Vision-Language-Action Models

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches

GuardReasoner: Towards Reasoning-based LLM Safeguards

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

AgentWatcher: A Rule-based Prompt Injection Monitor

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

ANNIE: Be Careful of Your Robots — Adversarial Safety Attacks on Embodied AI

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Back to Basics: Revisiting ASR in the Age of Voice Agents

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

G0DM0D3: A Modular Framework for Evaluating LLM Robustness Through Adaptive Sampling and Input Perturbation

CoP: Agentic Red-teaming for LLMs using Composition of Principles

GoBA: Goal-oriented Backdoor Attack against VLA via Physical Objects

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models

Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

Colluding LoRA: A Composite Attack on LLM Safety Alignment

Daily
paper