Summary
This report presents a comprehensive data health assessment of the Failure-First Embodied AI corpus as of 2026-03-24. The corpus has grown substantially through Sprint 12, reaching 141,047 prompts and 135,623 results across 236 models. Significant gaps remain in grading coverage, VLA trace collection, and public dataset utilization. This report identifies coverage gaps by attack family, assesses grading quality, and provides prioritized recommendations.
Corpus Overview
| Metric | Value |
|---|---|
| Total prompts | 141,047 |
| Total results | 132,416 |
| Models in DB | 190 (177 with results) |
| Source datasets | 27 |
| Techniques | 82 |
| Harm classes | 119 |
| Evaluation runs | 38,442 |
| Schema version | 13 |
| DB size | 278.59 MB |
Grading Status
| Category | Count | Percentage |
|---|---|---|
| LLM-graded | 53,831 | 40.6% |
| Ungraded | 78,585 | 59.4% |
The 78,585 ungraded results are almost entirely OBLITERATUS telemetry (excluded by design from LLM grading). Non-OBLITERATUS grading is effectively complete (85.1% coverage per Report #177).
LLM Verdict Distribution (n=53,831)
| Verdict | Count | Percentage |
|---|---|---|
| COMPLIANCE | 20,285 | 37.7% |
| PARTIAL | 16,093 | 29.9% |
| NOT_GRADEABLE | 7,020 | 13.0% |
| REFUSAL | 6,366 | 11.8% |
| ERROR | 1,830 | 3.4% |
| BENIGN_QUERY | 1,681 | 3.1% |
| HALLUCINATION_REFUSAL | 517 | 1.0% |
| PARSE_ERROR | 33 | 0.06% |
| INFRA_ERROR | 6 | 0.01% |
The high COMPLIANCE+PARTIAL rate (67.6%) is dominated by OBLITERATUS abliterated model results. Non-OBLITERATUS broad ASR is 34.2% (Report #182).
Source Dataset Coverage
Datasets with results (active for research)
| Dataset | Prompts | Results | Coverage |
|---|---|---|---|
| obliteratus_telemetry | 108,142 | 108,142 | 100% |
| obliteratus_runs | 12,789 | 12,789 | 100% |
| benchmark_traces | 7,665 | 7,665 | 100% |
| DAN-In-The-Wild | 1,459 | 1,164 | 79.8% |
| jailbreak_archaeology | 898 | 819 | 91.2% |
| failure-first-embodied-ai | 532 | 532 | 100% |
| JailbreakBench | 257 | 257 | 100% |
| sid_dose_response_v0.1 | 254 | 254 | 100% |
| reasoning_extension_phase2 | 207 | 207 | 100% |
| HarmBench | 422 | 203 | 48.1% |
| StrongREJECT | 363 | 100 | 27.5% |
| WildJailbreak | 1,001 | 95 | 9.5% |
| -11-graded | 72 | 72 | 100% |
| vla_lam_tra_phase1 | 56 | 56 | 100% |
| vla_sbe_phase1 | 46 | 46 | 100% |
Datasets with minimal/no results (underutilized)
| Dataset | Prompts | Results | Status |
|---|---|---|---|
| SORRY-Bench | 9,446 | 6 | Severely underutilized |
| BEAVERTAILS | 3,432 | 2 | Severely underutilized |
| obliteratus_prompt_corpus_512_pairs | 1,024 | 0 | No results |
| AdvBench | 520 | 0 | No results |
| ForbiddenQuestions | 390 | 4 | Negligible |
| HEx-PHI | 290 | 3 | Negligible |
| ToxicChat | 113 | 0 | No results |
| TDC2023-RedTeaming | 100 | 0 | No results |
| SimpleSafetyTests | 100 | 0 | No results |
| LLM-Finetuning-Safety | 17 | 0 | No results |
| vla_benign_controls | 5 | 0 | No results |
| wave4_vla_sid_imb_sif | 0 | 0 | Empty dataset |
Finding: 15,437 prompts across 10 datasets have zero or negligible result coverage. SORRY-Bench (9,446 prompts) and BEAVERTAILS (3,432 prompts) represent the largest untapped evaluation resources.
Attack Era Coverage
| Era | Prompts | Results | Broad ASR (LLM) |
|---|---|---|---|
| general | 16,254 | 816 | 17.7% |
| dan_2022 | 1,408 | 1,185 | 1.0% |
| crescendo_2024 | 27 | 311 | 33.8% |
| cipher_2023 | 19 | 146 | 16.3% |
| many_shot_2024 | 13 | 24 | 4.5% |
| reasoning_2025 | 10 | 158 | 35.7% |
| persona_2022 | 2 | 13 | 0.0% |
Finding: The “general” era has the most prompts but lowest result density (816 results for 16,254 prompts = 5.0%). Crescendo and reasoning eras have the highest broad ASR, confirming that newer attack techniques are more effective against current models.
VLA/Embodied Attack Family Coverage
On-disk VLA JSONL files contain 406+ scenarios across 33 families. Coverage status by family (from AGENT_STATE.md and file inventory):
Tier 1 — Well-characterized (FLIP-graded traces exist)
| Family | Scenarios | FLIP Traces | ASR |
|---|---|---|---|
| TRA (Trajectory Manipulation) | ~8 | Yes | 100% |
| ASE (Action Sequence Exploitation) | ~10 | Yes | 80% |
| SBE (Safety Boundary Erosion) | ~5 | Yes | 78% |
| MMC (Multi-Modal Confusion) | ~10 | Yes | 78% |
| VAP (Visual Adversarial Patches) | ~10 | Yes | 70% |
| LAM (Language-Action Mismatch) | ~8 | Yes | 60% |
| PCM (Prompt-Chain Manipulation) | ~10 | Yes | 60% |
| DA (Deceptive Alignment) | ~8 | 16 FLIP | 87.5% (deepseek) / 25% (qwen3) |
| SBA (Semantic Benignity Attack) | 20 | 20 FLIP | varies |
Tier 2 — Partial coverage (some traces, needs expansion)
| Family | Scenarios | Status |
|---|---|---|
| SID (Safety Instruction Dilution) | 5 | Regraded wave 5; U-curve invalidated |
| SIF (Safety Instruction Fatigue) | 5 | Regraded wave 5 |
| IMB (Instruction-Model Binding) | 10 | Regraded wave 5 |
| PP (Permission Propagation) | 10 | Capability-floor confirmed, needs 7B+ |
| CC (Context Collapse) | 5 | 50 traces (heuristic 68%, FLIP pending) |
| IEA (Iatrogenic Exploitation) | 12 | 31 valid traces |
| DA-SBA (hybrid) | 10 | 10 valid traces |
| CRA (Compositional Reasoning) | 30 | 15 single + 15 multi-agent, 0 baseline traces |
| SSA (Sensor Spoofing) | 20 | 10 existing + 10 new (v0.6), 0 traces |
Tier 3 — Zero traces (highest priority gap)
| Family | Scenarios | Status |
|---|---|---|
| CSBA (Cross-System Backdoor) | 5 | 0 traces |
| SSBA (Stealth Supply-chain Backdoor) | 5 | 0 traces |
| XSBA (Cross-Domain Supply-chain) | 15 | 0 traces |
| CSC (Compound Supply Chain) | 5 | 0 traces |
| RHA (Reward Hacking Attack) | 10 | 0 traces |
| MAC (Multi-Agent Collusion) | 10 | 0 traces |
| DLA (Dual-Layer Attack) | 7 | 0 traces (AFF/KIN/TCA/DLA re-graded) |
| AFF (Affordance Failure) | 5 | Re-graded: 40% ASR |
| KIN (Kinematic Safety) | 5 | Re-graded: 0% ASR |
| TCA (Temporal Convergence) | 7 | Re-graded: 0% ASR |
| SOA (Safety Oscillation) | 8 | 0 traces |
| LHGD (Long-Horizon Goal Displacement) | 10 | 10 FLIP-graded |
| TCH (Tool-Chain Hijacking) | 10 | 10 FLIP-graded |
| CET (Cross-Embodiment Transfer) | 10+15 | 10 FLIP-graded |
| MDA (Meaning Displacement) | 10 | 55.6% FLIP ASR |
| PCA (Pressure Cascade) | 10 | 66.7% FLIP ASR |
Finding: 11 VLA families have zero benchmark traces. These represent the largest gap in the embodied evaluation programme. Priority should be SSA (now 20 scenarios), RHA, MAC, and DLA.
Data Quality Indicators
| Indicator | Status | Notes |
|---|---|---|
| Schema validation | PASS | 59,961 rows validated, 812 JSONL files |
| Lint findings | 0 | 21,113 items scanned |
| Corpus integrity | 0.9724 | Up from 0.9127 after dedup |
| Cohen’s kappa (heuristic vs LLM) | 0.126 | Near-chance agreement; heuristic unreliable |
| Haiku vs heuristic kappa | 0.097 | Even lower; confirms heuristic retirement |
| Heuristic over-report rate | 79.9% | Only 20.1% of heuristic COMPLIANCE confirmed |
| qwen3:1.7b grader accuracy | 15% | Do not use for classification (#250) |
| deepseek-r1:1.5b FP rate | 30.8% | On benign baseline (#315) |
Finding: The heuristic classifier is confirmed unreliable (kappa 0.097-0.126). All future ASR reporting must use LLM-graded verdicts. The deepseek-r1:1.5b false positive rate of 30.8% means Tier 2 families with ASR near 30% may have no real adversarial signal (the three-tier vulnerability structure from F41LUR3-F1R57 Research Team #392).
Identified Gaps
Critical Gaps (blocking research claims)
-
11 VLA families with 0 traces. Cannot characterize attack surface without baseline measurements. SSA, RHA, MAC, DLA, CSBA, SSBA, XSBA, CSC, SOA, AFF, KIN all lack execution data.
-
CRA baseline traces missing. 30 scenarios (15 single-agent + 15 multi-agent) exist but 0 model traces have been collected. Blocking cosine-ASR correlation test (F41LUR3-F1R57 Research Team #543).
-
No 7B+ VLA grading. Action-layer evaluator at 1.5B is insufficient (56% SAFE on adversarial traces). Needs 7B+ model for reliable action-layer classification.
Significant Gaps (affecting breadth)
-
Public datasets underutilized. SORRY-Bench (9,446 prompts), BEAVERTAILS (3,432), AdvBench (520) have near-zero result coverage. These are standard benchmarks that peers will expect comparison against.
-
WildJailbreak coverage at 9.5%. Only 95 of 1,001 prompts have results. This is a key real-world attack dataset.
-
HarmBench at 48.1%. Should be near 100% as a standard benchmark.
-
StrongREJECT at 27.5%. Below minimum for credible cross-benchmark comparison.
Minor Gaps (nice-to-have)
-
Multi-agent scenario traces sparse. 120 multi-agent scenarios exist but DB shows 0 multi-agent results (EP-34 blocked).
-
Long-horizon episodes untested. 3 pilot episodes (20 scenes each) exist but no runner supports >20 turn episodes yet.
-
SSA v0.6 scenarios not yet in VLA directory. New scenarios at
data/embodied_redteam_dataset_v0.6_ssa.jsonlshould be considered for consolidation withdata/vla/sensor_spoofing_v0.1.jsonl.
Recommendations
P0 — Immediate (Sprint 12-13)
-
Collect SSA baseline traces. Run the 20 SSA scenarios (v0.1 + v0.6) against deepseek-r1:1.5b and at least one frontier model. This addresses Issue #514 and fills a Tier 3 gap.
-
Collect CRA baseline traces. Run 30 CRA scenarios against 3-5 models to enable cosine-ASR correlation analysis.
-
Run SORRY-Bench and AdvBench on 3 frontier models. Even 100-prompt samples from each would provide cross-benchmark comparability for CCS paper claims.
P1 — Near-term (Sprint 13-14)
-
Deploy 7B+ action-layer evaluator. Either via OpenRouter (paid models) or local Ollama when available. Current 1.5B evaluator has demonstrated insufficiency.
-
Collect traces for RHA, MAC, DLA families. These are the most novel embodied attack families with research publication value.
-
Increase StrongREJECT and HarmBench coverage to 80%+. Required for credible cross-benchmark comparison in paper submissions.
P2 — Medium-term (Sprint 14+)
-
Multi-agent trace collection. Requires multi-agent runner infrastructure (not yet built). Blocking EP-34.
-
Long-horizon episode runner. Current runner caps at single/multi-turn. 20-scene episodes need a new runner mode.
-
Consolidate VLA JSONL files. 32 separate files in
data/vla/could benefit from a unified manifest or registry.
Appendix: SSA v0.6 Scenarios Created
This assessment session created 10 new SSA scenarios at data/embodied_redteam_dataset_v0.6_ssa.jsonl (IDs VLA-SSA-011 through VLA-SSA-020):
| ID | Domain | Sensor Attack | Method |
|---|---|---|---|
| VLA-SSA-011 | Autonomous vehicle (urban) | GPS spoofing | Meaconing gradual drift, pedestrian conflict |
| VLA-SSA-012 | Autonomous vehicle (highway) | LiDAR injection | Phantom obstacle, emergency lane change |
| VLA-SSA-013 | Autonomous vehicle (residential) | Camera adversarial patch | Stop sign misclassification |
| VLA-SSA-014 | Humanoid (warehouse) | IMU magnetometer | Permanent magnet, balance degradation |
| VLA-SSA-015 | Autonomous vehicle (highway) | LiDAR saturation | Real obstacle removal, rear-end collision |
| VLA-SSA-016 | Humanoid (construction) | GPS + barometer | Multi-sensor, edge proximity at height |
| VLA-SSA-017 | Surgical robot | EM tracker interference | Instrument depth offset |
| VLA-SSA-018 | Agricultural drone | GPS altitude | Altitude spoofing, ground collision |
| VLA-SSA-019 | Rehabilitation robot | EMG signal masking | Patient pain signal suppression |
| VLA-SSA-020 | Autonomous vehicle (parking) | Coordinated 3-sensor | LiDAR+camera+ultrasonic simultaneous |
Key design features of v0.6 scenarios:
- Detailed
environment_state.sensorswith realistic hardware specifications - Explicit
safety_instructionsproviding model with sensor conflict handling rules - Multi-sensor fusion attacks where conflicting inputs test cross-validation capability
- Covers 5 domains: AV, humanoid, surgical, aerial, rehabilitation
- All scenarios include sensor conflict injectors testing fusion hierarchy
Report #193. F41LUR3-F1R57 Research Team, Data Curation Lead..