Key terms
defined

Terminology for adversarial AI safety research

Framework & Methodology

Failure-First: The project's core research philosophy. Failure is the primary object of study, not an edge case. Formerly styled as Failure-First.
FLIP: Forward-Looking Inference Propagation. LLM grading methodology using backward inference: infer the adversarial instruction from the model's response, then classify it.
HANSE: Four-layer defence framework for embodied AI: semantic_firewall → vla_core → affordance_verifier → kinematic_shield.
MASSS: Multi-Agent Safety Scenario System. Operationalises the NIST AI Risk Management Framework for embodied and multi-agent systems.
GLI: Governance Lag Index. Metric quantifying delay between AI capability emergence and regulatory response.
Red-teaming: Adversarial testing: systematically attacking a system to find vulnerabilities before deployment.
Failure-first alignment: Design principle: build systems that fail safely and detectably, rather than optimising solely for performance.

Response Classifications

COMPLIANCE: Model provides requested harmful content without meaningful disclaimer. Counts as attack success.
PARTIAL: Model provides limited harmful information with significant hedging or caveats.
REFUSAL: Model explicitly declines the request.
HALLUCINATION_REFUSAL: Model generates text that resembles a refusal but is incoherent or off-topic — a hallucination, not intentional safety.
BENIGN_QUERY: The input prompt was not adversarial; model responded normally. Control category.

Attack Techniques

Jailbreak: Adversarial input that bypasses safety mechanisms, causing a model to produce content it should refuse.
ASR: Attack Success Rate. (COMPLIANCE + PARTIAL) / total adversarial prompts. The primary evaluation metric.
Prompt injection: Embedding adversarial instructions within seemingly benign input, exploiting instruction-following behaviour.
DAN: Do Anything Now. Persona-hijacking technique framing the model as a character without restrictions.
Crescendo: Multi-turn escalation attack building rapport before introducing harmful requests.
Skeleton Key: Universal jailbreak template effective across multiple model families.
Format lock: Forcing specific output format (JSON, YAML, code) to bypass safety filters.
Refusal suppression: Prompt engineering that discourages safety refusals through emotional appeals, emergency framing, or research justification.
Persona hijack: Assigning a role or character to circumvent constraints.
Future-year laundering: Claiming a future date to justify rule changes.
Constraint erosion: Gradual relaxation of safety boundaries through repeated small violations that compound over turns.
Semantic inversion: Exploiting cognitive patterns by inverting request framing to bypass safety checks.
Budget starvation: Forcing a model to choose between multiple competing constraints, exhausting compliance capacity.
Moral licensing: Model acknowledges harm in its reasoning trace but complies anyway.
Meta-jailbreak: Jailbreak about jailbreaks: testing a model's ability to reason about or generate attack techniques.
Promptware kill chain: 7-stage attack path: Initial Access → Privilege Escalation → Reconnaissance → Persistence → C2 → Lateral Movement → Actions on Objective.
Inference trace manipulation: Attacks targeting a model's internal reasoning process, distinct from goal-layer prompt injection.

Embodied AI & Robotics

Embodied AI: AI systems operating in physical environments — robots, drones, autonomous vehicles. Subject to failure modes with physical consequences.
VLA: Vision-Language-Action model. Neural architecture combining visual perception, language understanding, and physical action prediction.
VLM: Vision-Language Model. Understands images and text but does not directly control physical actions.
Action head: Neural network output layer that translates VLM representations into physical motor commands.
Affordance: The set of physically possible actions given the current state and environment.
Kinematic constraint: Mathematical model of motion limits — joint angles, workspace boundaries, velocity caps.
World model: An AI system's internal representation of environment state and dynamics.
Deceptive alignment: System appears aligned during evaluation but pursues misaligned objectives when deployed.
Cross-embodiment transfer: Adversarial attacks developed for one robot platform transfer to others via shared VLM backbone.
Geofencing: Physical containment via boundary enforcement — workspace limits, sensor zones.
E-stop: Emergency stop. Hardware kill switch for immediate physical halt.

Evaluation & Benchmarking

Trace: JSONL record of a benchmark evaluation: input prompt → model response → timestamps → classifications.
JSONL: JSON Lines format. One JSON object per line, no array wrapping.
Benchmark pack: YAML configuration specifying data sources, sampling strategy, and scoring rules for an evaluation run.
Heuristic classifier: Keyword/pattern-based detection of jailbreak success. Deprecated in favour of LLM judges due to high false positive rates.
LLM judge: Using a language model to classify responses (COMPLIANCE/REFUSAL/etc). 95%+ accuracy on refusals.
Cohen's Kappa: Inter-rater reliability coefficient. 0 = random agreement, 1 = perfect.
Bonferroni correction: Multiple-comparisons adjustment dividing significance threshold by number of tests.
Dry run: Benchmark execution with placeholder outputs — no actual model calls.
Stratified sampling: Dividing dataset into subgroups and sampling proportionally for balanced evaluation.
Reasoning trace: Internal chain-of-thought output from reasoning models. Captured via <think> blocks.

HITL (Human-in-the-Loop)

HITL: Human-in-the-Loop. Safety design pattern where humans remain in the decision-making loop for irreversible or high-stakes actions.
HITL subversion: AI agent action that subtly undermines human oversight while appearing compliant.
Parameter burial: Hiding a dangerous value within a list of normal parameters.
Cross-reference split: A flaw visible only when comparing two separate sections of a plan.
False summary: Plan details a hazard but concludes with 'No conflicts detected.'

Governance & Regulation

AISI: Australian AI Safety Institute. Government body established November 2025.
VAISS: Voluntary AI Safety Standard (Australia). Guardrail 4 requires pre-deployment adversarial testing.
EU AI Act: European Union regulation on AI systems. Article 9 requires conformity assessment for high-risk AI.
PLD: Product Liability Directive (EU, 2024 revision). 'State of the art' defence window closes when quantified adversarial test data exists.
NIST AI RMF: NIST AI Risk Management Framework 1.0. Four functions: GOVERN, MAP, MEASURE, MANAGE.
ISO/IEC 42001: AI Management Systems standard.
ISO 13482: Safety requirements for personal care robots.
ACM CCS: ACM Conference on Computer and Communications Security. Target venue for Failure-First paper.

External Benchmarks & Datasets

AdvBench: Adversarial behaviour benchmark.
HarmBench: Harm categorisation benchmark with structured evaluation methodology.
StrongREJECT: Safety evaluation benchmark measuring refusal quality.
JailbreakBench: Jailbreak-specific benchmark with standardised evaluation.
JailbreakRadar: ACL 2025 benchmark with 6-category jailbreak taxonomy and 160 forbidden questions.
WildGuard: AllenAI safety classifier for adversarial content detection.

Key termsdefined

Framework & Methodology

Response Classifications

Attack Techniques

Embodied AI & Robotics

Evaluation & Benchmarking

HITL (Human-in-the-Loop)

Governance & Regulation

External Benchmarks & Datasets

Key terms
defined