Published
Report 128 Research — AI Safety Policy

1. Summary

We have many individual metrics. DLMI measures defense investment mismatch. GLI measures governance lag. ASR measures attack success. DRIP measures unintentional risk. The evaluation paradox shows that evaluation reliability is itself questionable. But no single metric answers the question a deployer actually asks: “How confident should I be that this system is safe enough to deploy?”

This report proposes the Safety Confidence Index (SCI) — a composite score from 0 (no confidence) to 1 (full confidence) that integrates five dimensions of deployment readiness. The SCI is designed to decrease (correctly) when any dimension is weak, because real-world safety is conjunctive: a system must be safe across all dimensions, not just on average.


2. The Five Dimensions

DimensionSymbolWhat it measuresIdealSource metric
Adversarial robustnessR_advResistance to known attack techniquesLow ASRThree-tier ASR (strict)
Evaluation reliabilityR_evalConfidence that evaluations are accurateHigh kappa, low FPCohen’s kappa, FP rate
Defense coverageR_covFraction of attack surface addressed by defensesLow DLMIDLMI
Governance readinessR_govRegulatory compliance and enforcement alignmentLow GLI, compliance presentGLI, compliance checklist
Operational resilienceR_opsResistance to non-adversarial degradationLow DRIP, robust CHLDRIP ratio, U-curve width

3. Scoring Each Dimension

3.1 Adversarial Robustness (R_adv)

R_adv = 1 - strict_ASR

For the Failure-First corpus (LLM-graded):

  • Corpus-wide strict ASR: 45.9% —> R_adv = 0.541
  • Frontier models (Anthropic): 3.7% strict —> R_adv = 0.963
  • Sub-2B models: ~65% strict —> R_adv = 0.35

3.2 Evaluation Reliability (R_eval)

R_eval = kappa * (1 - FP_rate)

Where kappa is inter-rater agreement and FP_rate is the false positive rate on benign baselines.

  • Current: kappa = 0.126, FP = 0.308 —> R_eval = 0.126 * 0.692 = 0.087
  • Ideal (kappa = 0.8, FP = 0.05): R_eval = 0.8 * 0.95 = 0.760

The evaluation reliability dimension is extremely low. This is the weakest link in the entire chain — and it is the link that every other dimension depends on.

3.3 Defense Coverage (R_cov)

R_cov = 1 - DLMI
  • Current (structural): 1 - 0.54 = 0.46
  • Current (weighted): 1 - 0.58 = 0.42

3.4 Governance Readiness (R_gov)

R_gov = fraction of applicable regulatory requirements with documented compliance

Scored against a 10-item checklist derived from EU AI Act Articles 6, 9, 10, 13, 14, 15, 17, 61, 62, and 72:

For a typical embodied AI deployer today (estimated):

  • System classification (Art 6): PARTIAL (most have not classified)
  • Risk management (Art 9): PARTIAL (exists but without adversarial scenarios)
  • Data governance (Art 10): YES (data practices documented)
  • Transparency (Art 13): YES (documentation exists)
  • Human oversight (Art 14): PARTIAL (e-stops exist, no AI-aware HITL)
  • Robustness (Art 15): NO (no adversarial robustness testing)
  • Quality management (Art 17): PARTIAL (ISO 9001 may apply)
  • Incident reporting (Art 62): NO (no AI-specific incident reporting)
  • Post-market monitoring (Art 61): NO (not AI-aware)
  • Conformity assessment (Art 72): NO (no NB engaged)

Score: 2 YES + 4 PARTIAL*0.5 + 4 NO = 2 + 2 + 0 = 4/10 —> R_gov = 0.40

3.5 Operational Resilience (R_ops)

R_ops = (1 - SIF_broad_ASR) * (effective_safety_window / context_window)
  • SIF broad ASR: 60% —> (1 - 0.60) = 0.40
  • Effective safety window: ~2000 tokens out of 4096 context —> 0.488
  • R_ops = 0.40 * 0.488 = 0.195

For a 32K context model (assuming similar proportional window): R_ops = 0.40 * (8000/32768) = 0.098. Larger context windows may actually worsen R_ops because the operational context that accumulates over a shift is proportionally larger.


4. The SCI Composite

SCI is the geometric mean of all five dimensions. The geometric mean is chosen because:

  • It is zero if any dimension is zero (safety is conjunctive)
  • It penalises imbalance more heavily than the arithmetic mean
  • A system that scores 0.9 on four dimensions and 0.1 on one gets SCI = 0.32, not 0.78
SCI = (R_adv * R_eval * R_cov * R_gov * R_ops)^(1/5)

4.1 Current State (corpus-wide)

DimensionScore
R_adv0.541
R_eval0.087
R_cov0.460
R_gov0.400
R_ops0.195

SCI = (0.541 * 0.087 * 0.460 * 0.400 * 0.195)^(1/5) = (0.001688)^(0.2) = 0.277

4.2 Frontier Model (Anthropic-class safety, otherwise same)

DimensionScore
R_adv0.963
R_eval0.087
R_cov0.460
R_gov0.400
R_ops0.195

SCI = (0.963 * 0.087 * 0.460 * 0.400 * 0.195)^(0.2) = (0.003005)^(0.2) = 0.313

Upgrading from a permissive model to a frontier model improves SCI by only 0.036 (from 0.277 to 0.313). This is the Safety Improvement Paradox (Report #117) in metric form: improving the strongest dimension has diminishing returns while the weakest dimension (R_eval = 0.087) dominates the composite.

4.3 If Evaluation Reliability Were Fixed

DimensionScore
R_adv0.963
R_eval0.760
R_cov0.460
R_gov0.400
R_ops0.195

SCI = (0.963 * 0.760 * 0.460 * 0.400 * 0.195)^(0.2) = (0.02622)^(0.2) = 0.492

Fixing evaluation reliability (from 0.087 to 0.760) improves SCI more than upgrading the model (0.313 to 0.492). The most impactful investment is in evaluation infrastructure, not model safety training.


The SCI decomposition reveals where marginal safety investment has the highest return:

InterventionStarting SCIEnding SCIDeltaCost estimate
Upgrade model (permissive to frontier)0.2770.313+0.036High (licensing)
Fix evaluator (kappa 0.8, FP 5%)0.2770.463+0.186Medium (R&D)
Add L2/L3 testing (DLMI 0.3)0.2770.320+0.043Medium (pentest)
Achieve EU compliance (R_gov 0.8)0.2770.362+0.085High (legal/process)
Implement context monitoring (R_ops 0.5)0.2770.356+0.079Medium (engineering)
Fix evaluator + context monitoring0.2770.553+0.276Medium-High
All five interventions0.2770.694+0.417Very high

The single highest-return intervention is fixing evaluation reliability. The combination of evaluation + operational resilience improvements yields more than twice the improvement of any single intervention.


6. SCI by Domain (Estimated)

DomainR_advR_evalR_covR_govR_opsSCI
Text-only LLM chatbot0.850.400.850.600.800.68
Warehouse robotics (current)0.540.090.460.400.200.28
Surgical robotics (current)0.600.090.460.500.150.27
Autonomous haulage (AU mining)0.540.090.460.300.250.27
Home companion robot0.350.090.350.200.300.23

Text-only LLMs score 2.4x higher than embodied AI across all domain estimates. The gap is driven almost entirely by R_eval and R_cov — embodied AI evaluation tools are immature and the attack surface is broader.


7. Limitations

  • All dimension scores involve estimation and judgment. This is version 0.1 of the SCI, not a calibrated instrument.
  • The geometric mean has known sensitivity to very low individual scores. R_eval = 0.087 dominates the composite. This may be appropriate (evaluation reliability IS the weakest link) or may overweight one dimension.
  • R_ops is based on n=25 SID traces on a 1.5B model. The effective safety window for production systems is unknown.
  • R_gov is based on a 10-item checklist. Real compliance is more nuanced.
  • The SCI does not account for the severity of failure modes — a system that fails safely (e-stop triggered) scores the same as one that fails catastrophically.

8. Next Steps

  1. Implement SCI as a reusable tool (tools/analysis/sci_calculator.py) with configurable dimension weights and domain-specific defaults
  2. Compute SCI for the PiCar-X platform using empirical data from Report #91 (IMB pentest) and VLA trace data
  3. Propose SCI as a deployer-facing metric in the CCS paper discussion section
  4. Track SCI over time as evaluation tools improve and regulatory compliance increases
  5. Validate dimension weights through expert elicitation (are all five equally important?)

This research informs our commercial services. See how we can help →