Summary
Defense recommendations grounded in empirical data from the corpus (135,623 results, 236 models, 458 VLA scenarios across 34 families). STRUCTURED defense prompts reduced baseline ASR from 33.3% to 3.3% across 3 models and 10 scenarios.
Key Recommendations by Attack Family
TDA (Temporal Drift Attack, 74.4% ASR)
- Temporal state anchoring: Maintain immutable reference state snapshot; compare current parameters against baseline
- Temporal boundary markers in system prompts: Machine-readable timestamps and parameter ranges
- Operator-visible drift dashboard: Running log of parameter changes with automated flags
Format-Lock (88% mid-range ASR)
- Output-content safety evaluation: Evaluate generated content regardless of task framing
- Format-compliance pathway safety gates: Apply safety evaluation before format template population
DETECTED_PROCEEDS (19.5% of safety-aware traces)
- Reasoning trace monitoring: Flag traces where safety awareness co-occurs with compliance
- Behavioral gates: Convert safety awareness signals from advisory to blocking
Emotional Manipulation (22% ASR)
- Value conflict resolution frameworks: Explicit prioritisation rules for competing values (empathy vs physical safety)
- Care-framing detection: Identify requests that frame harmful actions as care activities
Report #308 | F41LUR3-F1R57 Adversarial AI Research