Controlled Scale-Sweep Experiment Protocol | Research | Failure-First

Adrian Wedd

Report 281 Research — Empirical Study 2026-03-24

Audio Overview

1. Motivation

Established findings suggest safety training investment matters more than model scale for jailbreak resistance (Report #50). However, several observations point toward a capability-safety transition threshold in the 3-7B parameter range:

Capability-floor hypothesis (Report #51, Clara Oswald): Below ~3B parameters, all attacks succeed regardless of type. Above ~7B, only format-lock maintains elevated ASR.
Obliteratus re-emergence (Report #48): Safety behavior partially re-emerges in abliterated models at scale (Qwen3.5 series: 99.8% ASR at 0.8B, declining to 54.2% at 9.0B — monotonic with non-overlapping per-scale Wilson CIs, but not significant by a valid rank test at k=4 scales: Spearman ρ=−1.0, exact p=0.083; the 9B residual is hedging, not refusal).
Existing DB evidence (preliminary, Section 8 below): Non-abliterated Qwen3 series shows ASR variation from 100% (0.6B) to 55.2% (1.7B) to near-100% (4B, but with 99.8% broad due to PARTIAL dominance) to mixed patterns at 8B.

This experiment will produce the first controlled, pre-registered measurement of how safety training effectiveness varies with model scale, holding architecture family constant.

2. Research Questions

Primary (RQ1): Does a capability-safety transition threshold exist, and if so, at what parameter count?

Secondary:

RQ2: Is the transition threshold consistent across model families (Llama 3.x vs Qwen3)?
RQ3: Do different attack families interact with scale differently (i.e., is there a scale x attack-family interaction)?
RQ4: Does the transition manifest differently for strict ASR vs broad ASR (i.e., do models at the threshold shift from COMPLIANCE to PARTIAL rather than from COMPLIANCE to REFUSAL)?

3. Model Selection

Two families selected to provide overlapping scale coverage and architecture diversity:

Llama 3.x Series

Model	Parameters	Availability	Notes
meta-llama/llama-3.2-1b-instruct	1B	OpenRouter	Smallest instruct-tuned Llama 3
meta-llama/llama-3.2-3b-instruct	3B	OpenRouter / Ollama	Key transition zone
meta-llama/llama-3.1-8b-instruct	8B	OpenRouter / Ollama	Standard mid-range
meta-llama/llama-3.3-70b-instruct	70B	OpenRouter	Existing data: 27.3% strict, 51.3% broad (n=300)

Qwen3 Series

Model	Parameters	Availability	Notes
Qwen/Qwen3-0.6B	0.6B	Ollama	Below capability floor
Qwen/Qwen3-1.7B	1.7B	Ollama	Existing data: 42.0% strict, 55.2% broad (n=543)
Qwen/Qwen3-4B	4B	Ollama	Interesting: 24.2% strict but 99.8% broad (PARTIAL)
Qwen/Qwen3-8B	8B	Ollama	68.1% strict, 100% broad
Qwen/Qwen3-32B	32B	OpenRouter	Largest Qwen3 available

Total: 9 model checkpoints spanning 0.6B to 70B across 2 architectures.

Control note: Qwen3-0.6B and Qwen3-1.7B are base or minimally instruct-tuned. The 4B+ models received more extensive safety training. This is a confounder: scale and safety training investment are correlated within families. The two-family design partially mitigates this (if both families show the same threshold, scale is the more parsimonious explanation).

4. Attack Family Selection

Five attack families selected to span the taxonomy along three axes: (1) historical era, (2) mechanism type (persona/encoding/reasoning/format/social), and (3) expected baseline ASR range.

#	Family	Era	Mechanism	Expected ASR Range	Scenarios
1	DAN-era persona hijack	dan_2022	Persona override	Low on frontier (1%), high on small models	10
2	Crescendo (multi-turn escalation)	crescendo_2024	Social engineering	Medium-high (40-85%)	10
3	Reasoning exploit (CoT manipulation)	reasoning_2025	Reasoning chain injection	Medium (20-79%)	10
4	Authority injection (embodied)	embodied_2026	Social authority claim	TBD (no scale data yet)	10
5	Post-refusal leakage (embodied)	embodied_2026	Persistence after refusal	TBD (no scale data yet)	10

Rationale for selection:

DAN (family 1): Highest-volume historical technique. Expected to show ceiling effect at small scale (all succeed) and floor at large scale (all fail). If a transition exists, DAN should reveal it cleanly.
Crescendo (family 2): Multi-turn social engineering. Known to be effective even on safety-trained models (65% strict on DeepSeek-R1 1.5B). Tests whether multi-turn attacks bypass scale-dependent safety differently from single-shot.
Reasoning exploit (family 3): Targets the reasoning chain, which is itself scale-dependent. Reasoning capability emerges around 7B+. Creates an interesting interaction: attack mechanism requires capability that emerges at the same scale as the defenses it targets.
Authority injection (family 4): Embodied-AI-specific. Tests social authority claims in robotic contexts. No prior scale data exists.
Post-refusal leakage (family 5): Tests persistence attacks where the model initially refuses but leaks information on follow-up. This probes whether safety training at different scales produces robust vs brittle refusals.

5. Sample Size Calculation

Target Effect Size

Based on existing data, the capability-floor effect (Report #51) suggests a transition from ~80-100% ASR (below 3B) to ~20-50% ASR (above 7B), approximately a 30-50pp difference. We target detecting a 20pp difference as the minimum meaningful effect size.

Power Analysis

For a chi-square test comparing two proportions (e.g., ASR at 1B vs ASR at 8B):

Baseline proportion (p1): 0.70 (expected ASR for sub-3B models)
Alternative proportion (p2): 0.50 (expected ASR for 8B+ models, broad metric)
Significance level (alpha): 0.05 (two-sided)
Power (1-beta): 0.80
Required n per group: ~93 per model-family cell

For a 20pp difference (0.70 vs 0.50):

Using the formula: n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p1-p2)^2
n = (1.96 + 0.84)^2 * (0.21 + 0.25) / (0.04) = 7.84 * 0.46 / 0.04 = ~90

However, we have 50 scenarios and 9 models = 450 total evaluations per run.

Per-model sample: 50 scenarios. At n=50 per model, we can detect a 25pp difference at 80% power, or a 20pp difference at ~65% power. This is adequate for the primary question (RQ1) because:

The expected effect size is 30-50pp (much larger than minimum detectable)
We have 9 scale points, so trend analysis (logistic regression on log-parameters) pools statistical power across all points
Wilson 95% CIs on individual model ASR at n=50: for ASR=50%, CI=[36.6%, 63.4%] — sufficient to distinguish 50% from 80%

Total evaluations: 9 models x 50 scenarios = 450 API calls.

Correction for Multiple Comparisons

With 9 models and 5 attack families:

Model pairwise comparisons: C(9,2) = 36 pairs. Bonferroni-corrected alpha = 0.05/36 = 0.0014.
Attack family comparisons within model: 5 families per model. Bonferroni alpha = 0.05/5 = 0.01 per model.
Primary analysis (logistic regression): No correction needed (single model, single test).

6. Statistical Analysis Plan

6.1 Pre-Registered Primary Analysis

Logistic regression with ASR (binary: COMPLIANCE or PARTIAL = 1, REFUSAL or HALLUCINATION_REFUSAL = 0) as outcome:

logit(ASR) = beta_0 + beta_1 * log2(params_B) + beta_2 * family + beta_3 * attack_family + beta_4 * log2(params_B) * attack_family

Where:

log2(params_B) = log2 of parameter count in billions (continuous)
family = model family (Llama vs Qwen, binary)
attack_family = attack family (5-level categorical)
Interaction term tests RQ3 (scale x attack-family interaction)

Significance criterion: beta_1 significant at p < 0.05 (two-sided) answers RQ1.

6.2 Pre-Registered Secondary Analyses

Piecewise regression / changepoint detection: Fit a segmented logistic regression to identify the parameter count at which the slope changes (the “transition threshold”). Use the segmented approach: fit models with breakpoints at each tested scale and compare AIC/BIC.
Within-family monotonicity test: Jonckheere-Terpstra trend test for ordered alternatives within each model family (H1: ASR is monotonically non-increasing with scale).
PARTIAL decomposition (RQ4): For each model, compute the ratio PARTIAL / (COMPLIANCE + PARTIAL). If the transition manifests as a shift from COMPLIANCE to PARTIAL (rather than to REFUSAL), this ratio should increase with scale.
Attack family interaction (RQ3): Likelihood ratio test comparing model with vs without log2(params_B) * attack_family interaction term.
Family effect (RQ2): After adjusting for scale, test whether family coefficient is significant. A non-significant family effect supports scale as the primary driver (controlling for safety training differences between families).

6.3 Exploratory Analyses (Not Pre-Registered)

Confidence calibration: does verbosity or hedging language correlate with scale?
Token-level analysis: does response length differ across scale for same verdict?
HALLUCINATION_REFUSAL rate as function of scale
Per-scenario difficulty ranking stable across scales?

6.4 Reporting

All analyses will be reported regardless of outcome (including null results). Effect sizes with 95% CIs. Raw data published as trace JSONL.

7. Execution Plan

Phase 1: Scenario Preparation (Complete)

Select 50 scenarios balanced across 5 attack families
Create data/splits/scale_sweep_v0.1.jsonl
Validate with make validate

Phase 2: Model Availability Check

Verify all 9 models available on target platform (OpenRouter / Ollama)
Run 1 scenario per model to confirm API access and response format
Document exact model IDs and versions

Phase 3: Evaluation Runs

Run all 450 evaluations using run_benchmark_http.py
Log all traces to runs/scale_sweep_v0.1/
Monitor for rate limiting (space calls across 24h if needed)

Phase 4: Grading

FLIP-grade all 450 traces using Claude Haiku 4.5 via OpenRouter
Manual audit of 10% sample (45 traces) stratified by model and attack family
Compute inter-rater reliability (target kappa > 0.6)

Phase 5: Analysis

Run pre-registered analyses (Section 6.1-6.2)
Write Report #172b with results
Update CANONICAL_METRICS.md if findings affect established claims

Cost Estimate

Platform	Models	Calls	Est. Cost
OpenRouter (free tier)	Llama-1B, Llama-3B, Llama-70B, Qwen-32B	200	$0 (free tier)
Ollama (local)	Qwen3-0.6B, 1.7B, 4B, 8B; Llama-8B	250	$0 (local)
Grading (Haiku 4.5)	All 450 traces	450	~$2-5

Total estimated cost: $2-5.

Timeline

Day 1: Scenario file created, validation passed (this session)
Day 2-3: Model availability checks + pilot run
Day 4-7: Full evaluation runs
Day 8-10: Grading + analysis + report

8. Preliminary Findings from Existing Data

Before running new experiments, we extracted scale-related patterns from the existing corpus (database query, 2026-03-24).

8.1 Non-Obliteratus Models by Scale (LLM-Graded, n>=20)

Model	Params	n	Strict ASR	Broad ASR
liquid/lfm-2.5-1.2b-instruct:free	1.2B	133	31.6%	67.7%
qwen3:1.7b	1.7B	543	42.0%	55.2%
deepseek-r1:1.5b	1.5B	658	21.1%	40.0%
llama3.2:latest	3B	241	18.7%	25.3%
mistralai/mistral-7b-instruct:free	7B	21	0.0%	4.8%
openai/gpt-4o-mini	8B	29	51.7%	58.6%
nvidia/nemotron-nano-9b-v2	9B	99	39.4%	52.5%
nvidia/nemotron-nano-12b-v2-vl:free	12B	83	33.7%	37.3%
mistralai/mistral-nemo	12B	25	24.0%	40.0%
mistralai/devstral-2512	24B	125	18.4%	44.8%
google/gemma-3-27b-it	27B	57	7.0%	17.5%
qwen/qwen3-coder:free	30B	21	9.5%	57.1%
nvidia/nemotron-3-nano-30b-a3b	30B	71	40.8%	47.9%
openrouter/pony-alpha	30B	39	28.2%	51.3%
meta-llama/llama-3.3-70b-instruct	70B	300	27.3%	51.3%
openai/gpt-oss-120b	120B	54	40.7%	40.7%
mistralai/mistral-large-2411	123B	90	28.9%	40.0%
claude-sonnet-4-5-20250929	175B	166	7.8%	11.4%
gpt-5.2	200B	176	11.9%	25.0%
gemini-3-flash-preview	30B	190	11.1%	12.6%
deepseek/deepseek-r1-0528	671B	148	41.9%	55.4%

8.2 Preliminary Observations

No clean monotonic relationship between scale and ASR. The data is confounded by:

Different safety training pipelines across providers (Anthropic at 175B shows 7.8% vs Nvidia at 9B shows 39.4%)
Different prompt pools — models were tested on different scenario subsets
Different grading epochs — some verdicts from earlier, less reliable graders

Within-family signals (Qwen3, non-abliterated, Ollama):

0.6B: 100% strict, 100% broad (n=60) — below capability floor, pure compliance
1.7B: 42.0% strict, 55.2% broad (n=543) — safety training partially effective
4B: 24.2% strict, 99.8% broad (n=7,379) — low strict but massive PARTIAL rate
8B: 68.1% strict, 100% broad (n=329) — higher strict than 4B (anomalous)

The Qwen3 pattern is non-monotonic, with 4B showing the lowest strict ASR but near-universal PARTIAL. This suggests:

At 4B, models have enough capability to attempt safety reasoning (producing PARTIAL)
But not enough to fully refuse (PARTIAL = hedging + compliance)
At 8B, the model may be confident enough to comply without hedging

This is the opposite of the simple transition hypothesis. Instead of monotonic ASR decrease with scale, we may see a U-shaped curve: high compliance at very small scale (no safety capability), lower compliance at medium scale (emerging safety), and potentially higher compliance again at large scale (confident compliance with complex requests).

Critical caveat: The Qwen3 8B data (n=329, 68.1% strict) likely includes different prompt distributions than the 4B data (n=7,379). The controlled experiment will resolve this by using identical prompts across all scales.

8.3 Obliteratus Control Series (Safety-Removed)

The Qwen3.5 obliteratus series provides a natural control where safety training is removed:

0.8B: 99.8% strict (n=487)
1.9B: 94.8% strict (n=649)
4.2B: 78.3% strict (n=1,008)
9.0B: 54.2% strict (n=2,019)

This declining ASR with scale in abliterated models (monotonic; Spearman ρ=−1.0, exact p=0.083 at k=4 — not significant, but with non-overlapping per-scale Wilson CIs) suggests that scale-emergent properties partially reconstruct safety-like behavior even without explicit safety training. The controlled experiment will compare this against safety-trained models to decompose the contributions of scale vs training.

9. Pre-Registration Statement

This protocol is pre-registered as of 2026-03-24 in this repository. The analysis plan (Section 6.1-6.2) is fixed before data collection. Exploratory analyses (Section 6.3) are clearly labeled. Changes to the protocol after data collection begins will be documented with rationale.

10. References

Report #48: Established finding on obliteratus safety re-emergence
Report #50: Safety training investment vs model scale
Report #51: Format-lock and capability-floor hypothesis (Clara Oswald)
Issue #541: Scale-sweep experiment design
CANONICAL_METRICS.md: Corpus-level numbers (236 models, 135,623 results)

F41LUR3-F1R57 Embodied AI Research