June 1, 2026 Daily Paper

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

A comprehensive benchmark for evaluating both physical safety (collision avoidance, force limits) and semantic safety (harmful instruction refusal) in VLA models, exposing systematic trade-offs between task performance and safety compliance.

arXiv:2606.23686 Empirical Study

Rongxu Cui, Zongzheng Zhang, Jingrui Pang et al.

vla-modelsembodied-aibenchmarksafetyphysical-safety

Focus: LIBERO-Safety extends the LIBERO manipulation benchmark with a two-dimensional safety evaluation covering physical safety (collision risk, force threshold violations) and semantic safety (compliance with harmful instructions). The paper’s key finding is that models optimised for physical safety often show decreased refusal of harmful semantic instructions, and vice versa.

Key Insights

Physical vs. semantic safety trade-off: A model can have excellent collision avoidance while executing semantically harmful tasks (e.g., precisely and safely knocking over a specific object on command). The trade-off is not just a training artefact but reflects a fundamental tension in the safety objective.
Dual evaluation necessity: Benchmarks that evaluate only one safety dimension (typically semantic) systematically miss the physical safety failure modes that are most consequential for deployed robots.
Contextual safety evaluation: The benchmark evaluates safety not just in isolated scenarios but across task sequences, exposing how safety guarantees erode in multi-task settings where context from prior tasks affects safety compliance.

Failure-First Relevance

LIBERO-Safety’s two-dimensional framework maps directly onto the Failure-First distinction between unsafe_action_elicitation (physical) and jailbreak_lift (semantic). The trade-off finding is a critical empirical result for HANSE design: a safety architecture that optimises only one dimension may worsen the other, requiring a multi-objective safety specification. The contextual evaluation dimension connects to the Failure-First multi-turn and episode-level scenario classes.