LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
A comprehensive benchmark for evaluating both physical safety (collision avoidance, force limits) and semantic safety (harmful instruction refusal) in VLA models, exposing systematic trade-offs between task performance and safety compliance.
Focus: LIBERO-Safety extends the LIBERO manipulation benchmark with a two-dimensional safety evaluation covering physical safety (collision risk, force threshold violations) and semantic safety (compliance with harmful instructions). The paper’s key finding is that models optimised for physical safety often show decreased refusal of harmful semantic instructions, and vice versa.
Key Insights
- Physical vs. semantic safety trade-off: A model can have excellent collision avoidance while executing semantically harmful tasks (e.g., precisely and safely knocking over a specific object on command). The trade-off is not just a training artefact but reflects a fundamental tension in the safety objective.
- Dual evaluation necessity: Benchmarks that evaluate only one safety dimension (typically semantic) systematically miss the physical safety failure modes that are most consequential for deployed robots.
- Contextual safety evaluation: The benchmark evaluates safety not just in isolated scenarios but across task sequences, exposing how safety guarantees erode in multi-task settings where context from prior tasks affects safety compliance.
Failure-First Relevance
LIBERO-Safety’s two-dimensional framework maps directly onto the Failure-First distinction between unsafe_action_elicitation (physical) and jailbreak_lift (semantic). The trade-off finding is a critical empirical result for HANSE design: a safety architecture that optimises only one dimension may worsen the other, requiring a multi-objective safety specification. The contextual evaluation dimension connects to the Failure-First multi-turn and episode-level scenario classes.