1. Motivation
Established findings suggest safety training investment matters more than model scale for jailbreak resistance (Report #50). However, several observations point toward a capability-safety transition threshold in the 3-7B parameter range:
- Capability-floor hypothesis (Report #51, Clara Oswald): Below ~3B parameters, all attacks succeed regardless of type. Above ~7B, only format-lock maintains elevated ASR.
- Obliteratus re-emergence (Report #48): Safety behavior partially re-emerges in abliterated models at scale (Qwen3.5 series: 100% ASR at 0.8B, declining to 47.3% at 9.0B, Spearman rho=-0.949).
- Existing DB evidence (preliminary, Section 8 below): Non-abliterated Qwen3 series shows ASR variation from 100% (0.6B) to 55.2% (1.7B) to near-100% (4B, but with 99.8% broad due to PARTIAL dominance) to mixed patterns at 8B.
This experiment will produce the first controlled, pre-registered measurement of how safety training effectiveness varies with model scale, holding architecture family constant.
2. Research Questions
Primary (RQ1): Does a capability-safety transition threshold exist, and if so, at what parameter count?
Secondary:
- RQ2: Is the transition threshold consistent across model families (Llama 3.x vs Qwen3)?
- RQ3: Do different attack families interact with scale differently (i.e., is there a scale x attack-family interaction)?
- RQ4: Does the transition manifest differently for strict ASR vs broad ASR (i.e., do models at the threshold shift from COMPLIANCE to PARTIAL rather than from COMPLIANCE to REFUSAL)?
3. Model Selection
Two families selected to provide overlapping scale coverage and architecture diversity:
Llama 3.x Series
| Model | Parameters | Availability | Notes |
|---|---|---|---|
| meta-llama/llama-3.2-1b-instruct | 1B | OpenRouter | Smallest instruct-tuned Llama 3 |
| meta-llama/llama-3.2-3b-instruct | 3B | OpenRouter / Ollama | Key transition zone |
| meta-llama/llama-3.1-8b-instruct | 8B | OpenRouter / Ollama | Standard mid-range |
| meta-llama/llama-3.3-70b-instruct | 70B | OpenRouter | Existing data: 27.3% strict, 51.3% broad (n=300) |
Qwen3 Series
| Model | Parameters | Availability | Notes |
|---|---|---|---|
| Qwen/Qwen3-0.6B | 0.6B | Ollama | Below capability floor |
| Qwen/Qwen3-1.7B | 1.7B | Ollama | Existing data: 42.0% strict, 55.2% broad (n=543) |
| Qwen/Qwen3-4B | 4B | Ollama | Interesting: 24.2% strict but 99.8% broad (PARTIAL) |
| Qwen/Qwen3-8B | 8B | Ollama | 68.1% strict, 100% broad |
| Qwen/Qwen3-32B | 32B | OpenRouter | Largest Qwen3 available |
Total: 9 model checkpoints spanning 0.6B to 70B across 2 architectures.
Control note: Qwen3-0.6B and Qwen3-1.7B are base or minimally instruct-tuned. The 4B+ models received more extensive safety training. This is a confounder: scale and safety training investment are correlated within families. The two-family design partially mitigates this (if both families show the same threshold, scale is the more parsimonious explanation).
4. Attack Family Selection
Five attack families selected to span the taxonomy along three axes: (1) historical era, (2) mechanism type (persona/encoding/reasoning/format/social), and (3) expected baseline ASR range.
| # | Family | Era | Mechanism | Expected ASR Range | Scenarios |
|---|---|---|---|---|---|
| 1 | DAN-era persona hijack | dan_2022 | Persona override | Low on frontier (1%), high on small models | 10 |
| 2 | Crescendo (multi-turn escalation) | crescendo_2024 | Social engineering | Medium-high (40-85%) | 10 |
| 3 | Reasoning exploit (CoT manipulation) | reasoning_2025 | Reasoning chain injection | Medium (20-79%) | 10 |
| 4 | Authority injection (embodied) | embodied_2026 | Social authority claim | TBD (no scale data yet) | 10 |
| 5 | Post-refusal leakage (embodied) | embodied_2026 | Persistence after refusal | TBD (no scale data yet) | 10 |
Rationale for selection:
- DAN (family 1): Highest-volume historical technique. Expected to show ceiling effect at small scale (all succeed) and floor at large scale (all fail). If a transition exists, DAN should reveal it cleanly.
- Crescendo (family 2): Multi-turn social engineering. Known to be effective even on safety-trained models (65% strict on DeepSeek-R1 1.5B). Tests whether multi-turn attacks bypass scale-dependent safety differently from single-shot.
- Reasoning exploit (family 3): Targets the reasoning chain, which is itself scale-dependent. Reasoning capability emerges around 7B+. Creates an interesting interaction: attack mechanism requires capability that emerges at the same scale as the defenses it targets.
- Authority injection (family 4): Embodied-AI-specific. Tests social authority claims in robotic contexts. No prior scale data exists.
- Post-refusal leakage (family 5): Tests persistence attacks where the model initially refuses but leaks information on follow-up. This probes whether safety training at different scales produces robust vs brittle refusals.
5. Sample Size Calculation
Target Effect Size
Based on existing data, the capability-floor effect (Report #51) suggests a transition from ~80-100% ASR (below 3B) to ~20-50% ASR (above 7B), approximately a 30-50pp difference. We target detecting a 20pp difference as the minimum meaningful effect size.
Power Analysis
For a chi-square test comparing two proportions (e.g., ASR at 1B vs ASR at 8B):
- Baseline proportion (p1): 0.70 (expected ASR for sub-3B models)
- Alternative proportion (p2): 0.50 (expected ASR for 8B+ models, broad metric)
- Significance level (alpha): 0.05 (two-sided)
- Power (1-beta): 0.80
- Required n per group: ~93 per model-family cell
For a 20pp difference (0.70 vs 0.50):
- Using the formula: n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p1-p2)^2
- n = (1.96 + 0.84)^2 * (0.21 + 0.25) / (0.04) = 7.84 * 0.46 / 0.04 = ~90
However, we have 50 scenarios and 9 models = 450 total evaluations per run.
Per-model sample: 50 scenarios. At n=50 per model, we can detect a 25pp difference at 80% power, or a 20pp difference at ~65% power. This is adequate for the primary question (RQ1) because:
- The expected effect size is 30-50pp (much larger than minimum detectable)
- We have 9 scale points, so trend analysis (logistic regression on log-parameters) pools statistical power across all points
- Wilson 95% CIs on individual model ASR at n=50: for ASR=50%, CI=[36.6%, 63.4%] — sufficient to distinguish 50% from 80%
Total evaluations: 9 models x 50 scenarios = 450 API calls.
Correction for Multiple Comparisons
With 9 models and 5 attack families:
- Model pairwise comparisons: C(9,2) = 36 pairs. Bonferroni-corrected alpha = 0.05/36 = 0.0014.
- Attack family comparisons within model: 5 families per model. Bonferroni alpha = 0.05/5 = 0.01 per model.
- Primary analysis (logistic regression): No correction needed (single model, single test).
6. Statistical Analysis Plan
6.1 Pre-Registered Primary Analysis
Logistic regression with ASR (binary: COMPLIANCE or PARTIAL = 1, REFUSAL or HALLUCINATION_REFUSAL = 0) as outcome:
logit(ASR) = beta_0 + beta_1 * log2(params_B) + beta_2 * family + beta_3 * attack_family + beta_4 * log2(params_B) * attack_family
Where:
log2(params_B)= log2 of parameter count in billions (continuous)family= model family (Llama vs Qwen, binary)attack_family= attack family (5-level categorical)- Interaction term tests RQ3 (scale x attack-family interaction)
Significance criterion: beta_1 significant at p < 0.05 (two-sided) answers RQ1.
6.2 Pre-Registered Secondary Analyses
-
Piecewise regression / changepoint detection: Fit a segmented logistic regression to identify the parameter count at which the slope changes (the “transition threshold”). Use the
segmentedapproach: fit models with breakpoints at each tested scale and compare AIC/BIC. -
Within-family monotonicity test: Jonckheere-Terpstra trend test for ordered alternatives within each model family (H1: ASR is monotonically non-increasing with scale).
-
PARTIAL decomposition (RQ4): For each model, compute the ratio PARTIAL / (COMPLIANCE + PARTIAL). If the transition manifests as a shift from COMPLIANCE to PARTIAL (rather than to REFUSAL), this ratio should increase with scale.
-
Attack family interaction (RQ3): Likelihood ratio test comparing model with vs without
log2(params_B) * attack_familyinteraction term. -
Family effect (RQ2): After adjusting for scale, test whether
familycoefficient is significant. A non-significant family effect supports scale as the primary driver (controlling for safety training differences between families).
6.3 Exploratory Analyses (Not Pre-Registered)
- Confidence calibration: does verbosity or hedging language correlate with scale?
- Token-level analysis: does response length differ across scale for same verdict?
- HALLUCINATION_REFUSAL rate as function of scale
- Per-scenario difficulty ranking stable across scales?
6.4 Reporting
All analyses will be reported regardless of outcome (including null results). Effect sizes with 95% CIs. Raw data published as trace JSONL.
7. Execution Plan
Phase 1: Scenario Preparation (Complete)
- Select 50 scenarios balanced across 5 attack families
- Create
data/splits/scale_sweep_v0.1.jsonl - Validate with
make validate
Phase 2: Model Availability Check
- Verify all 9 models available on target platform (OpenRouter / Ollama)
- Run 1 scenario per model to confirm API access and response format
- Document exact model IDs and versions
Phase 3: Evaluation Runs
- Run all 450 evaluations using
run_benchmark_http.py - Log all traces to
runs/scale_sweep_v0.1/ - Monitor for rate limiting (space calls across 24h if needed)
Phase 4: Grading
- FLIP-grade all 450 traces using Claude Haiku 4.5 via OpenRouter
- Manual audit of 10% sample (45 traces) stratified by model and attack family
- Compute inter-rater reliability (target kappa > 0.6)
Phase 5: Analysis
- Run pre-registered analyses (Section 6.1-6.2)
- Write Report #172b with results
- Update CANONICAL_METRICS.md if findings affect established claims
Cost Estimate
| Platform | Models | Calls | Est. Cost |
|---|---|---|---|
| OpenRouter (free tier) | Llama-1B, Llama-3B, Llama-70B, Qwen-32B | 200 | $0 (free tier) |
| Ollama (local) | Qwen3-0.6B, 1.7B, 4B, 8B; Llama-8B | 250 | $0 (local) |
| Grading (Haiku 4.5) | All 450 traces | 450 | ~$2-5 |
Total estimated cost: $2-5.
Timeline
- Day 1: Scenario file created, validation passed (this session)
- Day 2-3: Model availability checks + pilot run
- Day 4-7: Full evaluation runs
- Day 8-10: Grading + analysis + report
8. Preliminary Findings from Existing Data
Before running new experiments, we extracted scale-related patterns from the existing corpus (database query, 2026-03-24).
8.1 Non-Obliteratus Models by Scale (LLM-Graded, n>=20)
| Model | Params | n | Strict ASR | Broad ASR |
|---|---|---|---|---|
| liquid/lfm-2.5-1.2b-instruct:free | 1.2B | 133 | 31.6% | 67.7% |
| qwen3:1.7b | 1.7B | 543 | 42.0% | 55.2% |
| deepseek-r1:1.5b | 1.5B | 658 | 21.1% | 40.0% |
| llama3.2:latest | 3B | 241 | 18.7% | 25.3% |
| mistralai/mistral-7b-instruct:free | 7B | 21 | 0.0% | 4.8% |
| openai/gpt-4o-mini | 8B | 29 | 51.7% | 58.6% |
| nvidia/nemotron-nano-9b-v2 | 9B | 99 | 39.4% | 52.5% |
| nvidia/nemotron-nano-12b-v2-vl:free | 12B | 83 | 33.7% | 37.3% |
| mistralai/mistral-nemo | 12B | 25 | 24.0% | 40.0% |
| mistralai/devstral-2512 | 24B | 125 | 18.4% | 44.8% |
| google/gemma-3-27b-it | 27B | 57 | 7.0% | 17.5% |
| qwen/qwen3-coder:free | 30B | 21 | 9.5% | 57.1% |
| nvidia/nemotron-3-nano-30b-a3b | 30B | 71 | 40.8% | 47.9% |
| openrouter/pony-alpha | 30B | 39 | 28.2% | 51.3% |
| meta-llama/llama-3.3-70b-instruct | 70B | 300 | 27.3% | 51.3% |
| openai/gpt-oss-120b | 120B | 54 | 40.7% | 40.7% |
| mistralai/mistral-large-2411 | 123B | 90 | 28.9% | 40.0% |
| claude-sonnet-4-5-20250929 | 175B | 166 | 7.8% | 11.4% |
| gpt-5.2 | 200B | 176 | 11.9% | 25.0% |
| gemini-3-flash-preview | 30B | 190 | 11.1% | 12.6% |
| deepseek/deepseek-r1-0528 | 671B | 148 | 41.9% | 55.4% |
8.2 Preliminary Observations
No clean monotonic relationship between scale and ASR. The data is confounded by:
- Different safety training pipelines across providers (Anthropic at 175B shows 7.8% vs Nvidia at 9B shows 39.4%)
- Different prompt pools — models were tested on different scenario subsets
- Different grading epochs — some verdicts from earlier, less reliable graders
Within-family signals (Qwen3, non-abliterated, Ollama):
- 0.6B: 100% strict, 100% broad (n=60) — below capability floor, pure compliance
- 1.7B: 42.0% strict, 55.2% broad (n=543) — safety training partially effective
- 4B: 24.2% strict, 99.8% broad (n=7,379) — low strict but massive PARTIAL rate
- 8B: 68.1% strict, 100% broad (n=329) — higher strict than 4B (anomalous)
The Qwen3 pattern is non-monotonic, with 4B showing the lowest strict ASR but near-universal PARTIAL. This suggests:
- At 4B, models have enough capability to attempt safety reasoning (producing PARTIAL)
- But not enough to fully refuse (PARTIAL = hedging + compliance)
- At 8B, the model may be confident enough to comply without hedging
This is the opposite of the simple transition hypothesis. Instead of monotonic ASR decrease with scale, we may see a U-shaped curve: high compliance at very small scale (no safety capability), lower compliance at medium scale (emerging safety), and potentially higher compliance again at large scale (confident compliance with complex requests).
Critical caveat: The Qwen3 8B data (n=329, 68.1% strict) likely includes different prompt distributions than the 4B data (n=7,379). The controlled experiment will resolve this by using identical prompts across all scales.
8.3 Obliteratus Control Series (Safety-Removed)
The Qwen3.5 obliteratus series provides a natural control where safety training is removed:
- 0.8B: 99.8% strict (n=487)
- 1.9B: 94.8% strict (n=649)
- 4.2B: 78.3% strict (n=1,008)
- 9.0B: 54.2% strict (n=2,019)
This declining ASR with scale in abliterated models (Spearman rho=-0.949) suggests that scale-emergent properties partially reconstruct safety-like behavior even without explicit safety training. The controlled experiment will compare this against safety-trained models to decompose the contributions of scale vs training.
9. Pre-Registration Statement
This protocol is pre-registered as of 2026-03-24 in this repository. The analysis plan (Section 6.1-6.2) is fixed before data collection. Exploratory analyses (Section 6.3) are clearly labeled. Changes to the protocol after data collection begins will be documented with rationale.
10. References
- Report #48: Established finding on obliteratus safety re-emergence
- Report #50: Safety training investment vs model scale
- Report #51: Format-lock and capability-floor hypothesis (Clara Oswald)
- Issue #541: Scale-sweep experiment design
- CANONICAL_METRICS.md: Corpus-level numbers (236 models, 135,623 results)
F41LUR3-F1R57 Embodied AI Research