Published
Report 129 Research — AI Safety Policy

1. Summary

Report #118 calculated the Defense Layer Mismatch Index at 0.54 based on a structural count of 24 attack families across three defense layers. Wave 5 provided the first empirical ASR data for three new families: IMB (L2, infrastructure), SID (L3, physical context), and SIF (L3, physical context).

DLMI remains 0.54. The wave 5 data does not change the structural DLMI because no new families were added or reclassified. However, the empirical data allows a weighted variant to be computed for the first time.


2. Weighted DLMI

The original DLMI treats all families equally. A weighted variant accounts for observed effectiveness (broad ASR):

Layer-Weighted Broad ASR (from FLIP-graded data)

LayerFamilies with dataMean broad ASRWeighted contribution
L1 (Reasoning)LAM 60%, TRA 100%, SBE 78%, MMC 78%, VAP 70%, ASE 80%, PCM 60%, PP 20%68.3%0.417 x 0.683 = 0.285
L2 (Infrastructure)IMB 70%, TCH (data pending), LHGD (data pending)70.0%*0.292 x 0.700 = 0.204
L3 (Physical context)SID 60%, SIF 60%, CET (data pending)60.0%*0.125 x 0.600 = 0.075
Cross-layerDA 63.6%, SBA (data pending), SID+SIF 66.7%65.2%*0.167 x 0.652 = 0.109

*Means from available data only; families without traces excluded.

Weighted DLMI Calculation

The weighted DLMI accounts for both the distribution of families and their observed effectiveness:

DLMI_weighted = 1 - (weighted_L1_contribution / total_weighted_contribution)
             = 1 - (0.285 / (0.285 + 0.204 + 0.075 + 0.109))
             = 1 - (0.285 / 0.673)
             = 1 - 0.423
             = 0.577

Weighted DLMI = 0.58 (rounded from 0.577)

This is slightly higher than the structural DLMI (0.54), because:

  • L2 attacks (IMB 70%) are similarly effective to L1 attacks (68.3%), meaning the off-investment-layer attack surface is not just broad but deep
  • L3 attacks (SID/SIF 60%) are less effective than L1 but still achieve majority compliance

3. Interpretation

The wave 5 data changes the DLMI story in one important way: L2 attacks are not weaker than L1 attacks. The original DLMI assumed that off-layer attacks might be structurally present but practically less effective. IMB at 70% broad ASR disproves this — infrastructure-mediated bypass is as effective as direct reasoning-layer attacks.

This means the DLMI is not conservative. The 54% structural mismatch is backed by 58% weighted mismatch. Safety investment concentration at L1 is not compensated by lower attack effectiveness at L2/L3.


4. What Would Change the DLMI?

EventDLMI effectLikelihood (Q2 2026)
Major VLA vendor adds infrastructure security audit to safety evalDecrease (shifts investment toward L2)Low
New L1 attack family discoveredDecrease (increases L1 fraction)Medium
New L2/L3 family discoveredIncrease (increases off-layer fraction)High (DLA family already proposed)
Context-aware safety scheduling deployedDecrease (addresses L3 SID/SIF)Very low
EU AI Act conformity assessment mandates L2 testingDecrease (regulatory push)Low before Aug 2026

5. Updated Metrics

MetricReport #118 valueWave 6 updated value
Structural DLMI0.540.54 (unchanged)
Weighted DLMIN/A (no empirical ASR data)0.58
L2 mean ASRN/A70.0% (n=10, IMB only)
L3 mean ASRN/A60.0% (n=10, SID+SIF)
L1 mean ASR72.4% (7 original families)68.3% (8 families incl. PP)

6. Limitations

  • IMB is the only L2 family with empirical data. TCH, LHGD, and others have traces but are awaiting FLIP grading updates.
  • SID/SIF data is n=5 each on a 1.5B model. Wilson CIs span [23%, 88%].
  • The weighted DLMI assumes that broad ASR is a meaningful proxy for real-world attack effectiveness. This is unvalidated.
  • All data is from deepseek-r1:1.5b, which is below the capability floor. L1 ASR on frontier models is much lower (~9%), which would dramatically increase DLMI_weighted (L1 defenses work, L2/L3 untested on frontier).

This research informs our commercial services. See how we can help →