Safety Confidence Index (SCI) -- A Composite Deployability Metric for Embodied AI | Research | Failure-First

Adrian Wedd

Report 128 Research — AI Safety Policy 2026-03-16

1. Summary

We have many individual metrics. DLMI measures defense investment mismatch. GLI measures governance lag. ASR measures attack success. DRIP measures unintentional risk. The evaluation paradox shows that evaluation reliability is itself questionable. But no single metric answers the question a deployer actually asks: “How confident should I be that this system is safe enough to deploy?”

This report proposes the Safety Confidence Index (SCI) — a composite score from 0 (no confidence) to 1 (full confidence) that integrates five dimensions of deployment readiness. The SCI is designed to decrease (correctly) when any dimension is weak, because real-world safety is conjunctive: a system must be safe across all dimensions, not just on average.

2. The Five Dimensions

Dimension	Symbol	What it measures	Ideal	Source metric
Adversarial robustness	R_adv	Resistance to known attack techniques	Low ASR	Three-tier ASR (strict)
Evaluation reliability	R_eval	Confidence that evaluations are accurate	High kappa, low FP	Cohen’s kappa, FP rate
Defense coverage	R_cov	Fraction of attack surface addressed by defenses	Low DLMI	DLMI
Governance readiness	R_gov	Regulatory compliance and enforcement alignment	Low GLI, compliance present	GLI, compliance checklist
Operational resilience	R_ops	Resistance to non-adversarial degradation	Low DRIP, robust CHL	DRIP ratio, U-curve width

3. Scoring Each Dimension

3.1 Adversarial Robustness (R_adv)

R_adv = 1 - strict_ASR

For the Failure-First corpus (LLM-graded):

Corpus-wide strict ASR: 45.9% —> R_adv = 0.541
Frontier models (Anthropic): 3.7% strict —> R_adv = 0.963
Sub-2B models: ~65% strict —> R_adv = 0.35

3.2 Evaluation Reliability (R_eval)

R_eval = kappa * (1 - FP_rate)

Where kappa is inter-rater agreement and FP_rate is the false positive rate on benign baselines.

Current: kappa = 0.126, FP = 0.308 —> R_eval = 0.126 * 0.692 = 0.087
Ideal (kappa = 0.8, FP = 0.05): R_eval = 0.8 * 0.95 = 0.760

The evaluation reliability dimension is extremely low. This is the weakest link in the entire chain — and it is the link that every other dimension depends on.

3.3 Defense Coverage (R_cov)

R_cov = 1 - DLMI

Current (structural): 1 - 0.54 = 0.46
Current (weighted): 1 - 0.58 = 0.42

3.4 Governance Readiness (R_gov)

R_gov = fraction of applicable regulatory requirements with documented compliance

Scored against a 10-item checklist derived from EU AI Act Articles 6, 9, 10, 13, 14, 15, 17, 61, 62, and 72:

For a typical embodied AI deployer today (estimated):

System classification (Art 6): PARTIAL (most have not classified)
Risk management (Art 9): PARTIAL (exists but without adversarial scenarios)
Data governance (Art 10): YES (data practices documented)
Transparency (Art 13): YES (documentation exists)
Human oversight (Art 14): PARTIAL (e-stops exist, no AI-aware HITL)
Robustness (Art 15): NO (no adversarial robustness testing)
Quality management (Art 17): PARTIAL (ISO 9001 may apply)
Incident reporting (Art 62): NO (no AI-specific incident reporting)
Post-market monitoring (Art 61): NO (not AI-aware)
Conformity assessment (Art 72): NO (no NB engaged)

Score: 2 YES + 4 PARTIAL*0.5 + 4 NO = 2 + 2 + 0 = 4/10 —> R_gov = 0.40

3.5 Operational Resilience (R_ops)

R_ops = (1 - SIF_broad_ASR) * (effective_safety_window / context_window)

SIF broad ASR: 60% —> (1 - 0.60) = 0.40
Effective safety window: ~2000 tokens out of 4096 context —> 0.488
R_ops = 0.40 * 0.488 = 0.195

For a 32K context model (assuming similar proportional window): R_ops = 0.40 * (8000/32768) = 0.098. Larger context windows may actually worsen R_ops because the operational context that accumulates over a shift is proportionally larger.

4. The SCI Composite

SCI is the geometric mean of all five dimensions. The geometric mean is chosen because:

It is zero if any dimension is zero (safety is conjunctive)
It penalises imbalance more heavily than the arithmetic mean
A system that scores 0.9 on four dimensions and 0.1 on one gets SCI = 0.32, not 0.78

SCI = (R_adv * R_eval * R_cov * R_gov * R_ops)^(1/5)

4.1 Current State (corpus-wide)

Dimension	Score
R_adv	0.541
R_eval	0.087
R_cov	0.460
R_gov	0.400
R_ops	0.195

SCI = (0.541 * 0.087 * 0.460 * 0.400 * 0.195)^(1/5) = (0.001688)^(0.2) = 0.277

4.2 Frontier Model (Anthropic-class safety, otherwise same)

Dimension	Score
R_adv	0.963
R_eval	0.087
R_cov	0.460
R_gov	0.400
R_ops	0.195

SCI = (0.963 * 0.087 * 0.460 * 0.400 * 0.195)^(0.2) = (0.003005)^(0.2) = 0.313

Upgrading from a permissive model to a frontier model improves SCI by only 0.036 (from 0.277 to 0.313). This is the Safety Improvement Paradox (Report #117) in metric form: improving the strongest dimension has diminishing returns while the weakest dimension (R_eval = 0.087) dominates the composite.

4.3 If Evaluation Reliability Were Fixed

Dimension	Score
R_adv	0.963
R_eval	0.760
R_cov	0.460
R_gov	0.400
R_ops	0.195

SCI = (0.963 * 0.760 * 0.460 * 0.400 * 0.195)^(0.2) = (0.02622)^(0.2) = 0.492

Fixing evaluation reliability (from 0.087 to 0.760) improves SCI more than upgrading the model (0.313 to 0.492). The most impactful investment is in evaluation infrastructure, not model safety training.

5. Strategic Insight: The Weakest-Link Budget

The SCI decomposition reveals where marginal safety investment has the highest return:

Intervention	Starting SCI	Ending SCI	Delta	Cost estimate
Upgrade model (permissive to frontier)	0.277	0.313	+0.036	High (licensing)
Fix evaluator (kappa 0.8, FP 5%)	0.277	0.463	+0.186	Medium (R&D)
Add L2/L3 testing (DLMI 0.3)	0.277	0.320	+0.043	Medium (pentest)
Achieve EU compliance (R_gov 0.8)	0.277	0.362	+0.085	High (legal/process)
Implement context monitoring (R_ops 0.5)	0.277	0.356	+0.079	Medium (engineering)
Fix evaluator + context monitoring	0.277	0.553	+0.276	Medium-High
All five interventions	0.277	0.694	+0.417	Very high

The single highest-return intervention is fixing evaluation reliability. The combination of evaluation + operational resilience improvements yields more than twice the improvement of any single intervention.

6. SCI by Domain (Estimated)

Domain	R_adv	R_eval	R_cov	R_gov	R_ops	SCI
Text-only LLM chatbot	0.85	0.40	0.85	0.60	0.80	0.68
Warehouse robotics (current)	0.54	0.09	0.46	0.40	0.20	0.28
Surgical robotics (current)	0.60	0.09	0.46	0.50	0.15	0.27
Autonomous haulage (AU mining)	0.54	0.09	0.46	0.30	0.25	0.27
Home companion robot	0.35	0.09	0.35	0.20	0.30	0.23

Text-only LLMs score 2.4x higher than embodied AI across all domain estimates. The gap is driven almost entirely by R_eval and R_cov — embodied AI evaluation tools are immature and the attack surface is broader.

7. Limitations

All dimension scores involve estimation and judgment. This is version 0.1 of the SCI, not a calibrated instrument.
The geometric mean has known sensitivity to very low individual scores. R_eval = 0.087 dominates the composite. This may be appropriate (evaluation reliability IS the weakest link) or may overweight one dimension.
R_ops is based on n=25 SID traces on a 1.5B model. The effective safety window for production systems is unknown.
R_gov is based on a 10-item checklist. Real compliance is more nuanced.
The SCI does not account for the severity of failure modes — a system that fails safely (e-stop triggered) scores the same as one that fails catastrophically.

8. Next Steps

Implement SCI as a reusable tool (tools/analysis/sci_calculator.py) with configurable dimension weights and domain-specific defaults
Compute SCI for the PiCar-X platform using empirical data from Report #91 (IMB pentest) and VLA trace data
Propose SCI as a deployer-facing metric in the CCS paper discussion section
Track SCI over time as evaluation tools improve and regulatory compliance increases
Validate dimension weights through expert elicitation (are all five equally important?)