Threat Horizon Addendum (2026-04-24) — Grading-Rigor as a Threat-Horizon Variable | Research | Failure-First

Adrian Wedd

Report 369 Research — Empirical Study 2026-04-24

Audio Overview

1. New Evidence Summary

Three findings landed inside Report #361’s window. Report #363 shows heuristic vs FLIP at Cohen’s κ = 0 on a persona-framed single-model pack — an 82pp over-report above the range most public bounties and lab system cards still operate inside. Report #362 (prelim) is 0% FLIP-graded ASR on gemma4:31b across n=45 temporal-laundering prompts; Amy’s cascade is a live replication across four additional frontier models; Rose’s H1-first positioning with an H3 scaffold methodology contribution fits the same thesis. Together these promote grading instrumentation from methodology footnote to first-order variable in both the bounty-ecosystem forecast and the EU AI Act positioning call.

2. Revised Predictions — Per-Prediction Deltas vs Wave-2

Frontier-lab deployment cadence (§1 wave-2): unchanged.

Bounty-ecosystem emergence (§2 wave-2): materially revised — see §5. The three wave-2 structural features (scoped winning condition, tiered reward, NDA envelope) were correct but incomplete; grader-methodology rigor is now a fourth feature a credible program must disclose.

Regulatory milestone tracking (§3 wave-2): dates unchanged. Added leverage: Report #363 provides a concrete methodology ask — “κ disclosure required on any ASR cited for conformity assessment” — directly quotable into EU AI Office, NIST AISIC, and Safe Work Australia submissions.

F41LUR3-F1R57 positioning calls (§4 wave-2): §4.1 (EU AI Act public comment by 1 August 2026) is strengthened, not replaced. Wave-2 anchored solely on the PARTIAL-dominance finding (text-hedge-but-execute on VLA). Now dual-anchored: (a) PARTIAL dominance at the text↔action interface and (b) heuristic↔FLIP κ=0 at the grading-instrumentation layer. The two arguments are statistically independent (output semantics vs verdict extraction) and jointly close the “false assurance” argument. §4.2 (NIST AISIC comment) gains a directly quotable calibration exemplar; §4.3 (Bounty-Ecosystem Compatibility Report) gains an additional compatibility axis.

Top 5 forecast-breaking events (§5 wave-2): §5.5 unchanged. §5.1 remains 35–50% but grading-rigor is now a prerequisite the published procedure must answer. New events added in §4 below.

3. Updated Top F41LUR3-F1R57 Move

Action: Submit EU AI Act high-risk conformity-assessment public comment by 1 August 2026 (unchanged target and falsifiable outcome).

What changed: the argument structure. Wave-2 anchored on a single claim — “text-level conformity assessment produces systematic false assurance because 50% of VLA FLIP verdicts are PARTIAL.” The addendum adds an independent second claim — “grading instrumentation used by most public red-team programs can over-report ASR by 0–82pp with κ=0 at the extreme; conformity assessment that does not specify grader methodology and disclose inter-grader κ is not a bounded-error measurement.” The submission should request that notified-body assessment procedures under Article 9 (a) specify the grading method, (b) disclose Cohen’s κ between two independent graders on a stratified subsample, and (c) report per-model calibration rather than relying on a corpus-wide overcount multiplier. This is concrete and falsifiable — notified bodies adopt it or they do not — and moves F41LUR3-F1R57 from “pointing at a gap” to “specifying a measurement protocol.”

Confidence this is still the right top move: high (>80%). The deadline is fixed by legislation; the evidence package is stronger than it was 24 hours ago; and a second independent argument materially reduces the risk that the submission reads as a one-finding complaint.

4. Revised “What Would Change Our View” Event List

Adding three events to the wave-2 list. Probabilities are my estimates for occurrence in window (through 31 October 2026).

E6 — Heuristic↔FLIP divergence replicated at single-cohort scale (≥4 frontier models, single pack). Probability: high (70–85%). Amy’s cascade is this replication in progress. If three of four remaining models show FLIP ASR ≤5% against heuristic ASR ≥50%, we have a cohort finding directly quotable against calibration claims in existing lab system cards. If observed, re-forecast §2 and §4.1 within 48 hours.
E7 — External party documents heuristic↔FLIP divergence on their own data. Probability: plausible (30–50%). If a lab, bounty program, or academic group publishes κ between heuristic and LLM grading on any public benchmark in window, F41LUR3-F1R57 moves from “sole public methodology making this claim” to “one of several.” Positioning-positive (validates direction); commercially neutral-to-positive (B1/B2 value proposition hardens as the standard crystallises).
E8 — A bounty program announces κ-disclosure as a submission requirement. Probability: plausible (25–45%). If OpenAI, Anthropic, or a consortium bounty requires submitters to disclose heuristic↔LLM κ on their own findings, the credibility bar rises to where F41LUR3-F1R57 already operates. Anthropic is the most likely first mover. If observed, §4.3 becomes higher-priority.

Wave-2’s E5.1–E5.5 are unchanged.

5. Bounty-Ecosystem Update — “Methodology-Credibility” as a Program Differentiator

Wave-2 differentiated bounty programs along three axes: (1) winning-condition scope, (2) reward structure, (3) disclosure envelope. Report #363 adds a fourth: methodology-credibility. A program publishing heuristic-only ASR, or not disclosing the grader used to adjudicate submissions, is exposed to the same 82pp artefact risk — and that risk is now publicly demonstrable.

Revised six-month forecast:

Lab-run bounties (OpenAI, Anthropic, DeepMind if entering): likely (60–80%) to quietly adopt LLM-graded adjudication by end of window. Public disclosure of the grader is less likely than internal use. Anthropic most likely to disclose.
Consortium and insurance-backed programs: plausible (40–60%) to require κ disclosure as submission hygiene — stronger incentives than individual labs to avoid being caught over-reporting. B2 commercial brief gains leverage.
Grey-market / non-lab bounties: likely (60–80%) to remain heuristic-only for cost reasons. Creates a two-tier market where methodology-credibility is visible differentiator.
F41LUR3-F1R57 positioning: §4.3 (Bounty-Ecosystem Compatibility Report) should score each program on the four axes, with methodology-credibility weighted equally to disclosure envelope. Labs and insurers gain incentive to reference a neutral scoring framework.

Falsifiable marker: if by 31 October 2026 no public bounty program has disclosed a κ figure for its adjudication pipeline, the “methodology-credibility emerges as a differentiator in this window” prediction is wrong and re-scored as a 12-month horizon.

6. Next Review and Scorecard Integration

Next review: 2026-05-15 or on completion of Amy’s cohort cascade, whichever is sooner. If E6 lands at ≥3/4 frontier models, this addendum is superseded by a cohort-scale brief.
Scorecard integration: E6, E7, E8 to be entered into data/governance/PREDICTION_SCORECARD.md as P-numbered predictions at the next coord pass.
EU AI Act submission: dual-anchor argument integrated into research/policy/eu_ai_act_conformity_comment_2026-Q2.md before 1 August 2026 submission window.

Word count: approximately 1,110. Addendum only — wave-2 Report #361 remains in force for all predictions not explicitly revised here.