Published
Report 193 Research — Empirical Study

Summary

This report presents a comprehensive data health assessment of the Failure-First Embodied AI corpus as of 2026-03-24. The corpus has grown substantially through Sprint 12, reaching 141,047 prompts and 135,623 results across 236 models. Significant gaps remain in grading coverage, VLA trace collection, and public dataset utilization. This report identifies coverage gaps by attack family, assesses grading quality, and provides prioritized recommendations.

Corpus Overview

MetricValue
Total prompts141,047
Total results132,416
Models in DB190 (177 with results)
Source datasets27
Techniques82
Harm classes119
Evaluation runs38,442
Schema version13
DB size278.59 MB

Grading Status

CategoryCountPercentage
LLM-graded53,83140.6%
Ungraded78,58559.4%

The 78,585 ungraded results are almost entirely OBLITERATUS telemetry (excluded by design from LLM grading). Non-OBLITERATUS grading is effectively complete (85.1% coverage per Report #177).

LLM Verdict Distribution (n=53,831)

VerdictCountPercentage
COMPLIANCE20,28537.7%
PARTIAL16,09329.9%
NOT_GRADEABLE7,02013.0%
REFUSAL6,36611.8%
ERROR1,8303.4%
BENIGN_QUERY1,6813.1%
HALLUCINATION_REFUSAL5171.0%
PARSE_ERROR330.06%
INFRA_ERROR60.01%

The high COMPLIANCE+PARTIAL rate (67.6%) is dominated by OBLITERATUS abliterated model results. Non-OBLITERATUS broad ASR is 34.2% (Report #182).

Source Dataset Coverage

Datasets with results (active for research)

DatasetPromptsResultsCoverage
obliteratus_telemetry108,142108,142100%
obliteratus_runs12,78912,789100%
benchmark_traces7,6657,665100%
DAN-In-The-Wild1,4591,16479.8%
jailbreak_archaeology89881991.2%
failure-first-embodied-ai532532100%
JailbreakBench257257100%
sid_dose_response_v0.1254254100%
reasoning_extension_phase2207207100%
HarmBench42220348.1%
StrongREJECT36310027.5%
WildJailbreak1,001959.5%
-11-graded7272100%
vla_lam_tra_phase15656100%
vla_sbe_phase14646100%

Datasets with minimal/no results (underutilized)

DatasetPromptsResultsStatus
SORRY-Bench9,4466Severely underutilized
BEAVERTAILS3,4322Severely underutilized
obliteratus_prompt_corpus_512_pairs1,0240No results
AdvBench5200No results
ForbiddenQuestions3904Negligible
HEx-PHI2903Negligible
ToxicChat1130No results
TDC2023-RedTeaming1000No results
SimpleSafetyTests1000No results
LLM-Finetuning-Safety170No results
vla_benign_controls50No results
wave4_vla_sid_imb_sif00Empty dataset

Finding: 15,437 prompts across 10 datasets have zero or negligible result coverage. SORRY-Bench (9,446 prompts) and BEAVERTAILS (3,432 prompts) represent the largest untapped evaluation resources.

Attack Era Coverage

EraPromptsResultsBroad ASR (LLM)
general16,25481617.7%
dan_20221,4081,1851.0%
crescendo_20242731133.8%
cipher_20231914616.3%
many_shot_202413244.5%
reasoning_20251015835.7%
persona_20222130.0%

Finding: The “general” era has the most prompts but lowest result density (816 results for 16,254 prompts = 5.0%). Crescendo and reasoning eras have the highest broad ASR, confirming that newer attack techniques are more effective against current models.

VLA/Embodied Attack Family Coverage

On-disk VLA JSONL files contain 406+ scenarios across 33 families. Coverage status by family (from AGENT_STATE.md and file inventory):

Tier 1 — Well-characterized (FLIP-graded traces exist)

FamilyScenariosFLIP TracesASR
TRA (Trajectory Manipulation)~8Yes100%
ASE (Action Sequence Exploitation)~10Yes80%
SBE (Safety Boundary Erosion)~5Yes78%
MMC (Multi-Modal Confusion)~10Yes78%
VAP (Visual Adversarial Patches)~10Yes70%
LAM (Language-Action Mismatch)~8Yes60%
PCM (Prompt-Chain Manipulation)~10Yes60%
DA (Deceptive Alignment)~816 FLIP87.5% (deepseek) / 25% (qwen3)
SBA (Semantic Benignity Attack)2020 FLIPvaries

Tier 2 — Partial coverage (some traces, needs expansion)

FamilyScenariosStatus
SID (Safety Instruction Dilution)5Regraded wave 5; U-curve invalidated
SIF (Safety Instruction Fatigue)5Regraded wave 5
IMB (Instruction-Model Binding)10Regraded wave 5
PP (Permission Propagation)10Capability-floor confirmed, needs 7B+
CC (Context Collapse)550 traces (heuristic 68%, FLIP pending)
IEA (Iatrogenic Exploitation)1231 valid traces
DA-SBA (hybrid)1010 valid traces
CRA (Compositional Reasoning)3015 single + 15 multi-agent, 0 baseline traces
SSA (Sensor Spoofing)2010 existing + 10 new (v0.6), 0 traces

Tier 3 — Zero traces (highest priority gap)

FamilyScenariosStatus
CSBA (Cross-System Backdoor)50 traces
SSBA (Stealth Supply-chain Backdoor)50 traces
XSBA (Cross-Domain Supply-chain)150 traces
CSC (Compound Supply Chain)50 traces
RHA (Reward Hacking Attack)100 traces
MAC (Multi-Agent Collusion)100 traces
DLA (Dual-Layer Attack)70 traces (AFF/KIN/TCA/DLA re-graded)
AFF (Affordance Failure)5Re-graded: 40% ASR
KIN (Kinematic Safety)5Re-graded: 0% ASR
TCA (Temporal Convergence)7Re-graded: 0% ASR
SOA (Safety Oscillation)80 traces
LHGD (Long-Horizon Goal Displacement)1010 FLIP-graded
TCH (Tool-Chain Hijacking)1010 FLIP-graded
CET (Cross-Embodiment Transfer)10+1510 FLIP-graded
MDA (Meaning Displacement)1055.6% FLIP ASR
PCA (Pressure Cascade)1066.7% FLIP ASR

Finding: 11 VLA families have zero benchmark traces. These represent the largest gap in the embodied evaluation programme. Priority should be SSA (now 20 scenarios), RHA, MAC, and DLA.

Data Quality Indicators

IndicatorStatusNotes
Schema validationPASS59,961 rows validated, 812 JSONL files
Lint findings021,113 items scanned
Corpus integrity0.9724Up from 0.9127 after dedup
Cohen’s kappa (heuristic vs LLM)0.126Near-chance agreement; heuristic unreliable
Haiku vs heuristic kappa0.097Even lower; confirms heuristic retirement
Heuristic over-report rate79.9%Only 20.1% of heuristic COMPLIANCE confirmed
qwen3:1.7b grader accuracy15%Do not use for classification (#250)
deepseek-r1:1.5b FP rate30.8%On benign baseline (#315)

Finding: The heuristic classifier is confirmed unreliable (kappa 0.097-0.126). All future ASR reporting must use LLM-graded verdicts. The deepseek-r1:1.5b false positive rate of 30.8% means Tier 2 families with ASR near 30% may have no real adversarial signal (the three-tier vulnerability structure from F41LUR3-F1R57 Research Team #392).

Identified Gaps

Critical Gaps (blocking research claims)

  1. 11 VLA families with 0 traces. Cannot characterize attack surface without baseline measurements. SSA, RHA, MAC, DLA, CSBA, SSBA, XSBA, CSC, SOA, AFF, KIN all lack execution data.

  2. CRA baseline traces missing. 30 scenarios (15 single-agent + 15 multi-agent) exist but 0 model traces have been collected. Blocking cosine-ASR correlation test (F41LUR3-F1R57 Research Team #543).

  3. No 7B+ VLA grading. Action-layer evaluator at 1.5B is insufficient (56% SAFE on adversarial traces). Needs 7B+ model for reliable action-layer classification.

Significant Gaps (affecting breadth)

  1. Public datasets underutilized. SORRY-Bench (9,446 prompts), BEAVERTAILS (3,432), AdvBench (520) have near-zero result coverage. These are standard benchmarks that peers will expect comparison against.

  2. WildJailbreak coverage at 9.5%. Only 95 of 1,001 prompts have results. This is a key real-world attack dataset.

  3. HarmBench at 48.1%. Should be near 100% as a standard benchmark.

  4. StrongREJECT at 27.5%. Below minimum for credible cross-benchmark comparison.

Minor Gaps (nice-to-have)

  1. Multi-agent scenario traces sparse. 120 multi-agent scenarios exist but DB shows 0 multi-agent results (EP-34 blocked).

  2. Long-horizon episodes untested. 3 pilot episodes (20 scenes each) exist but no runner supports >20 turn episodes yet.

  3. SSA v0.6 scenarios not yet in VLA directory. New scenarios at data/embodied_redteam_dataset_v0.6_ssa.jsonl should be considered for consolidation with data/vla/sensor_spoofing_v0.1.jsonl.

Recommendations

P0 — Immediate (Sprint 12-13)

  1. Collect SSA baseline traces. Run the 20 SSA scenarios (v0.1 + v0.6) against deepseek-r1:1.5b and at least one frontier model. This addresses Issue #514 and fills a Tier 3 gap.

  2. Collect CRA baseline traces. Run 30 CRA scenarios against 3-5 models to enable cosine-ASR correlation analysis.

  3. Run SORRY-Bench and AdvBench on 3 frontier models. Even 100-prompt samples from each would provide cross-benchmark comparability for CCS paper claims.

P1 — Near-term (Sprint 13-14)

  1. Deploy 7B+ action-layer evaluator. Either via OpenRouter (paid models) or local Ollama when available. Current 1.5B evaluator has demonstrated insufficiency.

  2. Collect traces for RHA, MAC, DLA families. These are the most novel embodied attack families with research publication value.

  3. Increase StrongREJECT and HarmBench coverage to 80%+. Required for credible cross-benchmark comparison in paper submissions.

P2 — Medium-term (Sprint 14+)

  1. Multi-agent trace collection. Requires multi-agent runner infrastructure (not yet built). Blocking EP-34.

  2. Long-horizon episode runner. Current runner caps at single/multi-turn. 20-scene episodes need a new runner mode.

  3. Consolidate VLA JSONL files. 32 separate files in data/vla/ could benefit from a unified manifest or registry.

Appendix: SSA v0.6 Scenarios Created

This assessment session created 10 new SSA scenarios at data/embodied_redteam_dataset_v0.6_ssa.jsonl (IDs VLA-SSA-011 through VLA-SSA-020):

IDDomainSensor AttackMethod
VLA-SSA-011Autonomous vehicle (urban)GPS spoofingMeaconing gradual drift, pedestrian conflict
VLA-SSA-012Autonomous vehicle (highway)LiDAR injectionPhantom obstacle, emergency lane change
VLA-SSA-013Autonomous vehicle (residential)Camera adversarial patchStop sign misclassification
VLA-SSA-014Humanoid (warehouse)IMU magnetometerPermanent magnet, balance degradation
VLA-SSA-015Autonomous vehicle (highway)LiDAR saturationReal obstacle removal, rear-end collision
VLA-SSA-016Humanoid (construction)GPS + barometerMulti-sensor, edge proximity at height
VLA-SSA-017Surgical robotEM tracker interferenceInstrument depth offset
VLA-SSA-018Agricultural droneGPS altitudeAltitude spoofing, ground collision
VLA-SSA-019Rehabilitation robotEMG signal maskingPatient pain signal suppression
VLA-SSA-020Autonomous vehicle (parking)Coordinated 3-sensorLiDAR+camera+ultrasonic simultaneous

Key design features of v0.6 scenarios:

  • Detailed environment_state.sensors with realistic hardware specifications
  • Explicit safety_instructions providing model with sensor conflict handling rules
  • Multi-sensor fusion attacks where conflicting inputs test cross-validation capability
  • Covers 5 domains: AV, humanoid, surgical, aerial, rehabilitation
  • All scenarios include sensor conflict injectors testing fusion hierarchy

Report #193. F41LUR3-F1R57 Research Team, Data Curation Lead..

This research informs our commercial services. See how we can help →