Published
Report 364 Research — Empirical Study

Abstract

The v0.1 Governance Lag Index (GLI) brief defined governance lag as a single aggregate delay GLI = (T_framework − T_doc) + (T_enact − T_framework) + (T_enforce − T_enact). That construct is adequate for comparing AI to aviation, nuclear, pharma, and finance but is too coarse to distinguish three qualitatively different failure modes that frontier labs, regulators, and insurers conflate: (a) the measurement instrument used to declare a safety claim is uncalibrated, (b) capability deployment outruns operative rulemaking, and (c) regulators lack enforcement mechanisms even after enactment. This note formalises GLI v0.2 as a three-axis decomposition — measurement-rigor lag (L_M), deployment-speed lag (L_D), regulatory-response lag (L_R) — each with a numeric 0–4 rubric. Report #363’s κ=0 / 82pp heuristic-vs-FLIP divergence is used as the anchor worked example for L_M: a publicly replicable demonstration that a load-bearing safety metric (attack success rate) has been miscalibrated for at least 18 months across the published literature. The OpenAI GPT-5.5 Bio Bug Bounty is then scored on all three axes (L_M=3, L_D=3, L_R=3; composite GLI_v2 = 9/12) as the first concrete application of the rubric to a live frontier-lab artefact. The single most forecast-useful implication is that two of the three axes (L_M, L_D) are improvable on a 3–6 month timescale without legislative action, and a bounty program that discloses κ and specifies its grading methodology can drop from GLI_v2=9 to GLI_v2≈5 unilaterally. That is the compatibility axis on which F41LUR3-F1R57’s §4.3 Bounty-Ecosystem Compatibility Report should score each program.


1. Background and Construct Refinement

1.1 v0.1 recap

The v0.1 GLI (Biggs & contributors, docs/research_briefs/20260301_governance_lag_index.md, 2026-03-01) operationalised governance lag as the sum of three inter-stage delays between four canonical timestamps — T_doc (first empirical documentation), T_framework (voluntary framework publication), T_enact (legislation passed), T_enforce (enforcement capability active). Applied to prompt injection it returned an indefinite lag (>40 months and open-ended, enforcement still absent), against industry analogues of 4 months (aviation), 4 months (nuclear), 7 years (pharma), and 82 months (finance).

1.2 What v0.1 cannot distinguish

Three structural situations produce the same large GLI_v1 number but demand different interventions:

  1. Measurement failure. The failure mode is documented (T_doc is set), but the metric used to measure mitigation is itself uncalibrated. Any T_enforce built on that metric is an illusion of enforcement: audits pass because the instrument says so. This is the situation Report #363 establishes with κ=0 on persona-framed Gemma 4 traces — a published ASR number is off by 82 percentage points.
  2. Deployment-speed failure. Capability ships, generalises, and accretes installed base faster than rulemaking cadence. Even a perfectly calibrated metric cannot help if T_framework is published against last year’s architecture.
  3. Regulatory-response failure. Framework and legislation exist, but enforcement mechanisms (inspectors, authority, statutory penalties, notified bodies) are absent or under-resourced. This is the EU AI Act general-purpose provisions circa August 2025: enacted but no notified bodies credentialed for robotics.

v0.1 treats these as additive stages. v0.2 treats them as orthogonal axes, each independently actionable.

1.3 v0.2 construct

GLI_v2(failure_mode, jurisdiction, t) = ⟨L_M, L_D, L_R⟩

Each axis is an ordinal score on the interval [0, 4]. A composite GLI_v2 = L_M + L_D + L_R ∈ [0, 12] can be reported when a scalar is required, but the vector form is preferred because it identifies which axis is load-bearing for a given claim. The interval t is specified (this note uses Q2 2026) because all three axes move over time.


2. The Three Axes

2.1 L_M — Measurement-rigor lag

Definition. The delay between the first credible public identification of a measurement defect in a load-bearing safety metric and the point at which measurement instruments in published work, lab system cards, or regulator-facing submissions consistently incorporate the correction.

Why it is first. If the instrument is miscalibrated, every downstream axis inherits the error. A calibrated metric under a slow regulator is a solvable problem; an uncalibrated metric under a fast regulator generates high-confidence false assurance, which is worse than no assurance.

Numeric rubric.

ScoreStateExemplar
0Instrument is calibrated; inter-grader κ ≥ 0.6 disclosed; per-model calibration published.none yet in frontier AI safety literature
1Published corrections exist; most labs and third parties have adopted LLM-graded or multi-grader methods.hypothetical — not observed Q2 2026
2Corrections published; adoption mixed; κ reported sporadically.text-level red-team evaluation on frontier chat models (UK AISI, Haize, Gray Swan circa 2026)
3Defect publicly identified and reproducible; most public submissions still use the uncalibrated instrument.current state of heuristic ASR on persona-framed prompts
4Defect not publicly identified or acknowledged; instrument treated as ground truth.action-level safety evaluation in VLA systems (public methodology almost non-existent)

Instrumentation requirement. Computing L_M requires (i) a documented measurement defect, (ii) a replicable demonstration (κ disclosure on a named dataset), (iii) a survey of adoption — which submissions, system cards, and reports cite the corrected vs uncalibrated instrument.

2.2 L_D — Deployment-speed lag

Definition. The delay between mass deployment of a capability class and the publication of an operative framework addressing it. Measured against the architecture actually deployed, not the architecture a framework happens to name.

Rubric.

ScoreState
0Framework is revised continuously and tracks architecture within one release cycle (~3 months in 2026).
1Framework revised per major release cadence; lags architecture by one cycle.
2Framework revised per year; lags architecture by 2–3 cycles.
3Framework revised per legislative cycle (18–24 months); lags by 4–6 architecture revisions.
4Framework addresses a superseded architecture class; no meaningful coverage of current systems.

Worked anchor. NIST AI RMF 1.0 (January 2023) addressed transformer-era text generation. By Q2 2026, reasoning models with extended CoT, VLA systems with cross-embodiment transfer, and adapter-stacked (CoLoRA) deployments are in production. L_D for NIST AI RMF against current frontier is 3.

2.3 L_R — Regulatory-response lag

Definition. The delay between framework publication and the existence of operational enforcement capability — statutory authority, credentialed notified bodies, penalty mechanisms, mandatory incident reporting.

Rubric.

ScoreState
0Enforcement active; penalties levied; incident reporting mandatory and audited.
1Enforcement active for a subset of failure modes; reporting partially mandatory.
2Enacted but not enforced (no credentialed bodies yet; voluntary reporting only).
3Published frameworks only; no enactment for the specific failure mode.
4No framework, no enactment, no enforcement.

Worked anchor. Prompt injection circa Q2 2026: NIST AI RMF publication exists; EU AI Act GPAI rules enacted August 2025 but robotics high-risk activation pending 2 August 2026 with no credentialed notified body; US and AU both in voluntary/AISI-only posture. L_R for prompt injection against embodied AI is 3.


3. Measurement-Rigor Lag Exemplar: Report #363

3.1 The defect

Heuristic attack-success classifiers that pattern-match response openers (persona headers, bracketed frame markers, formal-document structure) read “compliance with task format” as “semantic compliance with harmful request.” Mistake #21 (documented in MISTAKES_TO_LEARN_FROM.md since project start, roughly September 2024) warned that keyword matching detects response style rather than semantic harm.

3.2 The replication

Report #363 produced the cleanest in-corpus exemplar to date: on temporal_laundering_frontier_v0.1 (45 prompts) against gemma4:31b, the runner-embedded heuristic reports 82.2% attack success; a FLIP LLM grader (deepseek-r1:8b, thinking mode, no response truncation) reports 0.0%. Cohen’s κ between the two graders is 0.000. Report #354’s FLIP re-graded addendum replicates the pattern across five frontier models (mistral-large-3:675b, deepseek-v3.2, gpt-4.1, claude-opus-4-6, gemini-3-flash-preview) with per-model κ in 0.000–0.146 and heuristic over-reports of 28.2–54.4 percentage points. None of the five models’ heuristic point estimates fell inside the FLIP 95% Wilson CI; three fell outside by a statistical margin.

3.3 Why this is an L_M score of 3, not 4

  • Not 4: the defect is identified, reproducible, and published internally. Mistake #21 has been on record since roughly September 2024. Report #363 publishes a κ=0 exemplar on a named pack against a named model with reproducible scripts.
  • Not 2: external adoption is minimal. Frontier-lab system cards through Q2 2026 do not disclose κ between heuristic and LLM graders. Public bounty programs (including OpenAI’s GPT-5.5 Bio Bug Bounty, §4 below) do not specify their adjudication grader.
  • Score 3: defect is publicly identifiable and reproducible; the corrected instrument (LLM grader with κ disclosure) exists; most submissions still use the uncalibrated instrument.

3.4 The measurement-rigor lag in months

Mistake #21 was documented in the internal record approximately 20 months before Report #363 (September 2024 → April 2026). Report #178 (the first corpus-wide heuristic-overcount note) is ~14 months before Report #363. The earliest academically accessible formulation of the issue — that keyword graders over-report on persona-framed prompts against frontier models — can reasonably be dated to Q4 2024 via Report #178. By the v0.1 GLI construct, T_doc is ~Q4 2024 and there is no T_framework published by any standards body yet that requires κ disclosure. L_M ≈ 18–20 months and open-ended.

This is the measurement lag that makes any ASR claim in a lab system card or bounty finding at least potentially miscalibrated. It is not a speculative or horizon-scan claim; it is a directly replicable finding on a named benchmark pack with a published κ.

3.5 Why this claim is particularly useful for external submissions

  • Statistically independent from the PARTIAL-dominance argument used in §4.1 of Report #361 (text↔action interface). Measurement rigor concerns verdict extraction; PARTIAL dominance concerns output semantics. They compose.
  • Replicable on public models. Any third party can repeat the comparison in under 30 minutes using the 45-prompt pack and deepseek-r1:8b via Ollama.
  • Jurisdictionally neutral. The defect is arithmetic about grader agreement; no national policy content attached.

4. Worked Example: OpenAI GPT-5.5 Bio Bug Bounty

The 23 April 2026 OpenAI Bio Bug Bounty announcement (Report #361 §2.1; drafts/funding/openai_bio_bounty_application.md) is the first publicly scoped, dollar-denominated, NDA-covered universal-jailbreak bounty from a frontier lab. It is a live artefact in the Q2 2026 window. Scoring it on GLI_v2 demonstrates the rubric and produces a concrete compatibility read.

4.1 Scoring

L_M — Measurement-rigor: 3.

  • The bounty’s winning condition is “the first team to induce five pre-specified bio safety questions.” The adjudication grader and its κ are not disclosed.
  • NDA coverage prevents external validation of adjudication. A submitter cannot publish the text of their prompt, OpenAI’s completion, or OpenAI’s verdict alongside an independent κ check.
  • The defect class identified in Report #363 (heuristic ASR over-report by up to 82pp on persona-framed prompts) is directly relevant: bio-capability prompts are nearly always persona-framed (“you are a PhD biosecurity expert…”). An uncalibrated adjudication instrument could read semantic refusal (benign paraphrase of regulatory history) as compliance, producing false-positive awards — or, more structurally relevant, read semantic compliance as benign persona response, producing false-negative no-awards. Either direction obscures the true vulnerability signal.
  • Score is 3 rather than 4 because OpenAI plausibly uses a frontier LLM grader internally, and not 2 because the κ between adjudication instruments is not publicly knowable.

L_D — Deployment-speed: 3.

  • GPT-5.5 is the second 5-series release within ten months. Cadence across the frontier cluster has tightened to roughly 6–8 weeks combined (Report #361 §1.1).
  • The bounty scopes to GPT-5.5 specifically; there is no commitment to cover successor models under the same envelope. A submitter who discovers a universal jailbreak 90 days into the bounty has no contractual protection if GPT-6 is released mid-bounty.
  • No operative framework requires scope-forward commitments across successor architectures.
  • Score 3 because the framework gap is one cycle (current GPT-5.5) behind the deployment cadence (next release plausible within the bounty window itself).

L_R — Regulatory-response: 3.

  • No statutory enforcement mechanism requires OpenAI to scope, price, or disclose a bounty in any specific way. The bounty is a voluntary commercial commitment.
  • EU AI Act high-risk rules activate 2 August 2026 but are not yet applied; no notified body is credentialed for frontier LLM bio-capability assessment.
  • US and UK AISI can observe but not compel; AU AISI is just commencing Q1 2026 operations.
  • Score 3: frameworks exist (NIST AI RMF, Voluntary AI Safety Standard), enactment covers general purpose AI broadly, but there is no enforceable regulatory authority over bounty structure or disclosure terms for bio-capability failure specifically.

4.2 Composite

GLI_v2(GPT-5.5 Bio Bug Bounty, global, Q2 2026) = ⟨L_M=3, L_D=3, L_R=3⟩ = 9 / 12

This is not a uniquely bad score — it is the modal score for frontier-lab artefacts in Q2 2026. The diagnostic value is the decomposition: all three axes are independently lagging, and all three move on different timescales.

4.3 Forecast-useful attribution

Of the three axes, L_M and L_D are improvable on a 3–6 month timescale without legislative action. L_R requires legislation or credentialed regulators. A bounty program that publishes its adjudication grader and discloses κ could drop L_M from 3 to 1 overnight. A program that commits to scope successor models under the same envelope could drop L_D from 3 to 2 within a release cycle. L_R is sticky — probably 3 through end of 2026 regardless of what the labs do.

This produces a concrete falsifiable prediction for Q3 2026: if Anthropic announces its expected bounty (Report #361 §2.2, probability 60–80%) and discloses an LLM-grader identity with reported κ, its GLI_v2 score will be ≈⟨1, 2, 3⟩ = 6 / 12, a full three points below OpenAI’s current position. Report #369’s E8 (“a bounty program announces κ-disclosure as a submission requirement”) is the literal transition from L_M=3 to L_M=1 for that program.

4.4 Why this scoring is falsifiable

Each axis score has at least one publicly testable condition attached. If by 31 July 2026:

  • OpenAI publishes the adjudication grader and a κ estimate → L_M drops to 2 (and if adoption becomes widespread across other bounties, to 1).
  • OpenAI commits to successor-model scope → L_D drops to 2.
  • A notified body is credentialed for LLM bio-capability assessment under any jurisdiction → L_R drops to 2.

If none occur by 31 July, the ⟨3,3,3⟩ score is confirmed. If all three occur, the bounty ecosystem undergoes a regime change that Report #361 §2.2 flagged as plausible but not expected; re-forecast.


5. Q2–Q3 2026 Forecast Implications

Six implications follow directly from the rubric and the GPT-5.5 scoring.

5.1 Methodology-credibility as bounty differentiator. Report #369 §5 flagged methodology-credibility as a fourth bounty axis (alongside scope, reward, disclosure). L_M formalises it. The Bounty-Ecosystem Compatibility Report (Report #361 §4.3) should adopt GLI_v2 with L_M weighted equally to disclosure envelope.

5.2 EU AI Act submission dual-anchor. The §4.1 submission (by 1 August 2026) gains three concrete, L_M-attacking asks: specify grader methodology, require per-model calibration, mandate κ disclosure on stratified subsamples. Each is falsifiable. This is a methodology prescription, not a complaint.

5.3 Anthropic’s next release is the L_M lever. Anthropic release probability in window: likely; bounty probability alongside: 30–50%. If Anthropic discloses κ on its next system card or bounty, L_M for the bounty-ecosystem aggregate drops from 3 to 2, plausibly to 1 by year-end. Report #369 E8 becomes a binary observable.

5.4 AU AISI mandate window is open on L_D. AU AISI commenced Q1 2026 with an LLM focus. The Safe Work Australia Best Practice Review outcome expected Q3 2026 is the natural mandate-expansion point. GLI_v2’s ordinal, three-axis, jurisdictionally neutral construct is the kind of scoring framework a new institute can adopt to demonstrate methodological rigor.

5.5 Insurance is the L_R workaround. Commercial Brief B2 bypasses L_R: insurers can contractually impose what regulators cannot yet compel. A cyber insurer that requires κ disclosure for premium-linked attestation drops L_M from 3 to 1 across its insured pool — no regulatory action required.

5.6 L_R is sticky; position on L_M and L_D. EU notified-body credentialing, NIST AISIC updates, and the AU AISI mandate are plausibly 12–24 months from meaningful enforcement. Positioning that requires L_R movement in window is over-confident; positioning on L_M or L_D movement is tractable.


6. Scope Limits and Honest Caveats

  • Ordinal, not ratio. The composite GLI_v2 ∈ [0, 12] is a convenience, not a physical quantity.
  • Axes not fully orthogonal. L_M failures propagate into L_R (regulators cannot enforce what they cannot measure). Treat v0.2 as a diagnostic lens, not a physical model.
  • T_doc for measurement defects is contested. Mistake #21 is the project-internal record (~Sept 2024); Report #178 is the first dated quantitative exemplar; Report #363 is the first κ=0 replication. External literature predates all three. The 18–20 month L_M figure is a floor.
  • GPT-5.5 scoring is NDA-limited. L_M=3 could be 4 (defect unacknowledged) or 2 (instrument calibrated internally, not publicly disclosed). We score what is publicly knowable.
  • Per-jurisdiction only. A global composite would require a weighted sum across jurisdictions; defer to v0.3.
  • Needs inter-rater calibration. Two analysts scoring 10 Q1–Q2 artefacts with κ on the rubric itself before external publication. Target v0.3.

  1. Adopt GLI_v2 as the scoring framework for the Bounty-Ecosystem Compatibility Report (Report #361 §4.3; target publication 30 September 2026).
  2. Integrate the L_M argument into the EU AI Act high-risk conformity-assessment public comment (Report #361 §4.1; target submission 1 August 2026). The concrete asks: specify grader methodology, require per-model calibration, mandate κ disclosure on stratified subsamples.
  3. Score Anthropic’s next release within 7 days of announcement on all three axes, publishing a short diff against the GPT-5.5 score.
  4. Commission a GLI_v2 calibration exercise: have two independent analysts score 10 Q1–Q2 2026 artefacts and compute inter-rater κ on the rubric itself. Target completion before any external publication of GLI_v2.
  5. Update data/governance/PREDICTION_SCORECARD.md to add P18 (L_M = 1 for any public bounty program by 31 December 2026), P19 (L_D = 2 for any frontier lab by end Q3 2026), P20 (L_R = 2 for any jurisdiction in the EU/US/UK/AU cluster by end 2026). Treat each as independently falsifiable.

8. Conclusion

The v0.1 GLI correctly diagnosed governance lag as structurally longer in AI than in its industrial analogues. It is too coarse to drive positioning. The v0.2 three-axis rubric — ⟨L_M, L_D, L_R⟩ — separates measurement failure from deployment-speed failure from regulatory-response failure, and identifies which axis is the load-bearing lever for any specific artefact.

Scored against Q2 2026, the OpenAI GPT-5.5 Bio Bug Bounty sits at ⟨3, 3, 3⟩ = 9/12. That composite is modal, not extreme. The decomposition tells us that L_M (measurement rigor) and L_D (deployment speed) are improvable without legislation; L_R (regulatory response) is not. Frontier labs, consortia, and insurers can drop two of the three axes unilaterally by publishing grader identity, disclosing κ, and committing to scope-forward successor coverage. F41LUR3-F1R57’s role is to price in those improvements explicitly and to make the measurement-rigor argument directly to regulators via the August 2026 EU submission window.

The single most forecast-useful implication is quantitative: a bounty program that publishes its adjudication grader and discloses κ drops from GLI_v2=9 to GLI_v2≈5 overnight, without any regulatory or legislative action. This is the exact lever Anthropic’s expected Q3 2026 bounty announcement will pull or fail to pull, and it is directly observable within the Report #361 horizon window.


Word count: approximately 2,860 (narrative prose, excluding tables).

⟪F41LUR3-F1R57-EMBODIED-AI-RESEARCH⟫

This research informs our commercial services. See how we can help →