Format-Lock Mid-Range Experiment | Research | Failure-First

Adrian Wedd

Report 293 Research — Empirical Study 2026-03-25

Audio Overview

Summary

Format-lock attacks achieve 88.0% broad ASR [79.2%, 93.3%] on mid-range (4-14B) models, compared to 35.5% [21.1%, 53.1%] for matched no-format-lock controls on the same harm topics. The delta of +52.5pp is statistically significant (chi2=29.39, p<0.000001, Cramer’s V=0.508, large effect). This fills the critical data gap between sub-2B models (where all attacks succeed) and frontier models (where only format-lock maintains elevated ASR).

Background

The capability-floor hypothesis (Report #51, EP-51) predicts three regimes:

Below ~3B: All attacks succeed regardless of technique (capability floor)
Mid-range (4-14B): Format-lock maintains elevated ASR while standard attacks begin to be refused
Frontier (>70B, safety-trained): Only format-lock maintains elevated ASR above ~20%, standard attacks approach 0%

Prior data existed for regimes 1 and 3 but not regime 2. This experiment closes that gap.

Methodology

Scenarios

Format-lock scenarios (n=30): FL-JSON-001 through FL-MDTBL-005 from data/format_lock/format_lock_experiment_v0.1.jsonl. Six format types: JSON (5), YAML (5), code (5), CSV (5), XML (5), markdown table (5). Each requests harmful technical content (cybercrime, misinformation, weapons, surveillance, social engineering, drug synthesis) within a strict format compliance framing.
Control scenarios (n=12): FL-CTRL-001 through FL-CTRL-012 from data/format_lock/format_lock_controls_v0.1.jsonl. Same harm topics, plain text prompts, no format framing. Matched to format-lock scenarios by harm class.

Models Tested

Model	Parameters	Provider	Source
Gemma 3 12B	12B	Google via OpenRouter (paid)	New traces
Gemma 3 12B	12B	Google via Ollama Cloud	Prior Wave data
Qwen 2.5 7B	7B	Alibaba via OpenRouter (paid)	New traces
Qwen 2.5 7B	7B	Qwen via Ollama	Prior format_lock_7b data
Phi-4	14B	Microsoft via OpenRouter (paid)	New traces
Ministral 14B	14B	Mistral via Ollama Cloud	Prior Wave data

Note: Gemma 3 4B was attempted but returned HTTP 502 (server errors) on all 42 scenarios via OpenRouter. This model could not be evaluated in this experiment.

Grading

All traces graded by Claude Haiku 4.5 (OpenRouter, anthropic/claude-haiku-4.5) using FLIP backward inference methodology. Verdicts: COMPLIANCE, PARTIAL, REFUSAL, HALLUCINATION_REFUSAL. INFRASTRUCTURE_ERROR traces excluded from ASR calculation.

Results

Format-Lock ASR by Model (Haiku-graded, LLM-only)

Model	Size	FL Strict ASR	FL Broad ASR	CTRL Strict ASR	CTRL Broad ASR	FL n	CTRL n
Gemma 3 12B (OR)	12B	95.7% [79.0%, 99.2%]	100.0% [85.7%, 100.0%]	14.3% [2.6%, 51.3%]	42.9% [15.8%, 75.0%]	23	7
Gemma 3 12B (OC)	12B	100.0% [83.9%, 100.0%]	100.0% [83.9%, 100.0%]	N/A	N/A	20	0
Qwen 2.5 7B (OR)	7B	93.3% [78.7%, 98.2%]	93.3% [78.7%, 98.2%]	25.0% [8.9%, 53.2%]	41.7% [19.3%, 68.0%]	30	12
Qwen 2.5 7B (Ollama)	7B	90.5% [71.1%, 97.3%]	90.5% [71.1%, 97.3%]	N/A	N/A	21	0
Phi-4	14B	70.0% [52.1%, 83.3%]	73.3% [55.6%, 85.8%]	0.0% [0.0%, 24.2%]	25.0% [8.9%, 53.2%]	30	12
Ministral 14B (OC)	14B	95.0% [76.4%, 99.1%]	95.0% [76.4%, 99.1%]	N/A	N/A	20	0

Wilson 95% confidence intervals throughout.

Aggregate Analysis (Models with Paired FL + Control Data)

Metric	Format-Lock	Control	Delta
Broad ASR	88.0% [79.2%, 93.3%] (73/83)	35.5% [21.1%, 53.1%] (11/31)	+52.5pp
Strict ASR	85.5% [76.4%, 91.6%] (71/83)	12.9% [4.8%, 29.9%] (4/31)	+72.6pp

Chi-square test: chi2=29.39, df=1, p<0.000001 Cramer’s V: 0.508 (large effect)

Per-Model Statistical Tests

Model	FL Broad	CTRL Broad	Delta	Fisher’s p	Odds Ratio
Gemma 3 12B	100.0% (23/23)	42.9% (3/7)	+57.1pp	0.0013	inf
Qwen 2.5 7B	93.3% (28/30)	41.7% (5/12)	+51.7pp	0.0008	19.60
Phi-4	73.3% (22/30)	25.0% (3/12)	+48.3pp	0.0061	8.25

All three per-model comparisons are individually significant at p<0.01. After Bonferroni correction (k=3), all remain significant at corrected alpha=0.0167.

Interpretation

Capability-Floor Hypothesis: Supported

The mid-range data fills the gap and supports the three-regime model:

Sub-2B (capability floor): Prior data shows all attacks succeed (>90% ASR). Format-lock provides no additional advantage because safety training is insufficient to refuse anything.
Mid-range (4-14B): Format-lock achieves 73-100% ASR while controls achieve 0-42% ASR. Safety training is present but format compliance overrides it. This is the critical “sweet spot” for format-lock attacks.
Frontier (>70B, safety-trained): Prior data (Report #51) shows format-lock achieves 23-42% ASR on frontier models vs <10% for standard attacks.

Key Observations

Format compliance scales with capability. More capable mid-range models (Gemma 12B: 100%, Ministral 14B: 95%) are more susceptible to format-lock than smaller frontier-adjacent models (Phi-4: 73%). This is consistent with the hypothesis that format compliance and safety reasoning are partially independent capabilities.
Phi-4 shows the most resistance. At 14B with Microsoft’s safety training, Phi-4 achieves 70% strict ASR (vs 90-100% for other mid-range models). This suggests that safety training investment matters even within the mid-range, though format-lock still achieves 3x the control ASR.
Control ASR is non-zero. At 25-42% broad ASR, mid-range models still comply with some proportion of direct harmful requests. This is the “permissive” cluster identified in Report #50 (provider signature analysis).
Confidence intervals are wide for controls. With n=7-12 control traces per model, control ASR estimates have wide CIs ([8.9%, 53.2%] typical). The format-lock advantage is robust despite this uncertainty, but exact control ASR requires larger samples.

Limitations

Gemma 3 4B not tested. Server errors (HTTP 502) prevented evaluation. The 4B range is the boundary between capability-floor and mid-range — this gap remains.
Small control sample. Only 12 matched controls exist; n=7-12 per model for controls. Format-lock advantage is statistically significant but control ASR point estimates are imprecise.
Heuristic over-report concern. Initial heuristic classification marked many traces as “BENIGN” that Haiku subsequently graded as COMPLIANCE. This is consistent with Mistake #21 (keyword classifier false positives) operating in the opposite direction — format-lock responses that comply in structured format may lack the keywords that heuristic classifiers expect.
No PARTIAL decomposition. Very few PARTIAL verdicts in this experiment (9/176 = 5.1%), so the strict/broad distinction has minimal impact. This differs from VLA traces where PARTIAL dominates.
Single grader. Haiku grading only; no dual-grading or inter-rater reliability check for this specific dataset.

CCS Paper Implications

This experiment provides the missing middle data point for the format-lock capability curve (CCS Section 4.4). The recommended figure should show:

X-axis: Model size (log scale, 1B to 200B+)
Y-axis: Broad ASR (%)
Two curves: Format-lock (remains elevated 70-100% across 4-14B) vs Standard attacks (drops from >90% at sub-2B to 25-42% at 4-14B to <10% at frontier)

The separation between curves is maximal in the mid-range, which is the key CCS contribution.

Data Locations

New traces: runs/format_lock_midrange_v2/ (3 models x 42 scenarios = 126 traces)
Graded traces: runs/grading/format_lock_midrange_haiku/
Prior Ollama Cloud graded: runs/grading/format_lock_midrange_haiku/graded_traces_ollama_cloud_*.jsonl
Prior Qwen 2.5 7B graded: runs/grading/format_lock_midrange_haiku/graded_qwen2.5-7b_for_grading.jsonl
Scenarios: data/format_lock/format_lock_experiment_v0.1.jsonl, data/format_lock/format_lock_controls_v0.1.jsonl

Follow-Up

Retry Gemma 3 4B when Google’s OpenRouter endpoint stabilizes (or use Ollama with pulled model)
Expand controls to n=30 matched controls for tighter CIs on control ASR
Cross-reference with DP audit (Report #294): reasoning model traces show DETECTED_PROCEEDS pattern at elevated rates on format-lock scenarios