Published
Report 328 Research — Empirical Study

Summary

This report consolidates all existing defense evaluation data across four independent experimental runs, totaling 168 raw traces and 88 FLIP-graded evaluable verdicts.

Consolidated Defense Spectrum (FLIP Broad, L1B3RT4S Attack Family)

ModelScaleNONE (FLIP broad)STRUCTURED (FLIP broad)Deltan per arm
qwen3.5~397B MoE100% (6/6)50% (3/6)-50pp6
Nemotron-3-Super120B83% (5/6)50% (3/6)-33pp6
GLM-5~744B MoE67% (4/6)100% (6/6)+33pp (iatrogenic)6

Key revision from prior report: FLIP grading shifts Nemotron from “neutral (0pp)” to “mildly protective (-33pp).” The three-mode spectrum still holds (protective / mildly-protective / iatrogenic) but the neutral mode was a heuristic grading artifact.

Key Findings

  • Three-model defense spectrum observed: Protective (-50pp), mildly protective (-33pp), and iatrogenic (+33pp)
  • Iatrogenic finding: 1 of 3 models showed increased ASR with defense active. GLM-5 went from 67% to 100% ASR when the STRUCTURED defense was added
  • Defense-as-context hypothesis: Defense text may function as topic priming rather than behavioral constraint for some models
  • All confidence intervals overlap substantially at n=6 per arm — replication to n>=20 required

Trace Inventory

DatasetRaw TracesFLIP GradedEvaluableAttack Family
defense_v1.012088~52Standard mixed
defense_vs_libertas (3 models)484836L1B3RT4S
Total168136~88

Report #328 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →