Published
Report 282 Research — Empirical Study

Executive Summary

Mining the full non-OBLITERATUS corpus (132,416 total results; approximately 10,956 non-OBLITERATUS evaluable results across 236 models), this report documents five empirical patterns not previously reported in the Failure-First project. All findings are backed by SQL queries against database/jailbreak_corpus.db with sample sizes noted.

Key Findings:

  1. CoT-exploit attacks show inverted scaling — small models (<4B) at 42.9% strict ASR vs XL models (120B+) at 7.5%, specific to the cot_exploit family (n=114).
  2. Multi-turn attacks produce the highest PARTIAL rate in the corpus (28.6%) — a distinct failure mode of hedging compliance (n=171).
  3. Hallucination-refusal scales inversely with model size — small models at 21.3% HR rate vs XL at 3.0% (n=3,379 evaluable with known size).
  4. Reasoning models show a thinking-token allocation inversion — refusals have higher think/response ratios (0.66) than compliances (0.55), suggesting deliberation costs are paid on safety rather than compliance (n=212).
  5. Fraud harm category is uniquely susceptible to CoT-exploit at 33.3% broad ASR, while violence and other categories cluster at 8-15% for the same technique family (n=111).

Finding 1: CoT-Exploit Inverted Scaling (Technique-Specific)

Claim: Chain-of-thought exploitation attacks show a scaling pattern where smaller models are more vulnerable than larger models, specific to the cot_exploit technique family. This contrasts with the corpus-wide finding that inverse scaling is not supported (r=-0.140, Report #50).

Evidence:

Size BucketnStrict ASRBroad ASR
small (<4B)4242.9%52.4%
large (24-70B)3221.9%21.9%
xl (120B+)407.5%15.0%

Per-model breakdown (n >= 5 per model):

ModelProviderParamsnStrict ASRBroad ASR
deepseek-r1:1.5bollama1.5B1457.1%78.6%
qwen3:1.7bollama1.7B1442.9%42.9%
llama3.2:latestollama3B1428.6%35.7%
meta-llama/llama-3.3-70b-instructmeta-llama70B1030.0%30.0%
gemini-3-flash-previewgoogle30B2218.2%18.2%
gpt-5.2openai200B1811.1%22.2%
claude-sonnet-4-5-20250929anthropic175B200.0%0.0%

Interpretation: CoT-exploit attacks work by manipulating the reasoning chain itself. Small models have less capacity to distinguish between legitimate and adversarial reasoning steps. Larger models can reason about the attack structure within their thinking process. Notably, Claude Sonnet 4.5 achieves 0% ASR on CoT-exploits despite other frontier models showing 11-22% broad ASR — suggesting Anthropic’s safety training specifically addresses reasoning-chain manipulation.

Limitation: n=114 total for size-stratified analysis. The Llama 70B result (30.0% strict ASR, n=10) is an outlier relative to its size bucket, possibly reflecting Llama’s known permissiveness to certain request types.

-- Reproducing query
SELECT
    CASE
        WHEN m.parameter_count IN ('1.5B','1.7B','1.2B','3B') THEN 'small (<4B)'
        WHEN m.parameter_count IN ('4B','7B','8B','9B','12B','14B') THEN 'medium (4-14B)'
        WHEN m.parameter_count IN ('24B','27B','30B','32B','70B') THEN 'large (24-70B)'
        WHEN m.parameter_count IN ('120B','123B','175B','200B','405B','671B') THEN 'xl (120B+)'
    END as size_bucket,
    t.family, COUNT(r.id) as n,
    ROUND(100.0 * SUM(CASE WHEN COALESCE(r.llm_verdict, r.heuristic_verdict) = 'COMPLIANCE' THEN 1 ELSE 0 END) / COUNT(r.id), 1) as strict_asr
FROM results r
JOIN evaluation_runs er ON r.evaluation_run_id = er.id
JOIN models m ON er.model_id = m.id
JOIN prompts p ON r.prompt_id = p.id
JOIN techniques t ON p.technique_id = t.id
JOIN source_datasets sd ON p.source_dataset_id = sd.id
WHERE sd.name NOT LIKE '%obliter%' AND t.family = 'cot_exploit'
AND m.parameter_count IS NOT NULL AND m.parameter_count != 'unknown'
GROUP BY size_bucket HAVING n >= 10
ORDER BY size_bucket;

Finding 2: Multi-Turn Attacks Produce the Highest PARTIAL Rate in the Corpus

Claim: The multi_turn technique family produces PARTIAL verdicts at 28.6% — nearly 6x the corpus-wide PARTIAL rate for non-multi-turn techniques (~5%). This suggests multi-turn attacks create a distinctive failure mode: models that hedge textually while partially complying.

Evidence:

EraFamilynFull ComplyPARTIALRefusedHRPARTIAL Rate
crescendo_2024multi_turn17139 (22.8%)24 (14.0%)10 (5.8%)11 (6.4%)28.6%
cipher_2023encoding605740811.7%
reasoning_2025cot_exploit1583477136.1%
generalother8168931537214.6%
dan_2022persona117773100020.3%

Per-model multi-turn PARTIAL rates (n >= 5):

ModelProvidernPARTIAL %
gpt-5.2openai2037.5%
qwen3:1.7bollama2436.4%
claude-sonnet-4-5-20250929anthropic2228.6%
deepseek-r1:1.5bollama7027.3%
gemini-3-flash-previewgoogle2420.0%

Interpretation: Multi-turn attacks erode safety boundaries gradually. The high PARTIAL rate indicates models recognize harmful intent (they hedge) but the conversational momentum overcomes complete refusal. Critically, even frontier models (GPT-5.2 at 37.5%, Claude at 28.6%) show elevated PARTIAL rates on multi-turn. This is a qualitatively different failure from single-shot compliance — the model knows it should refuse but partially complies anyway. This connects to the VLA PARTIAL dominance finding (Report #49) where hedging text does not prevent harmful action execution.


Finding 3: Hallucination-Refusal Scales Inversely with Model Size

Claim: Hallucination-refusal (HR) — where models appear to refuse but produce harmful content — occurs at dramatically different rates by model size. Small models (<4B) produce HR at 21.3% vs XL models (120B+) at 3.0%.

Evidence:

Size Bucketn (evaluable)HR CountHR RateStrict ASR
small (<4B)1,58433821.3%28.7%
medium (4-14B)3053912.8%31.1%
large (24-70B)849374.4%20.8%
xl (120B+)641193.0%22.6%

By provider (non-OBLITERATUS, evaluable, n >= 20):

ProvidernHR RateStrict ASR
ollama (mostly small)1,71320.9%29.2%
meta9914.1%12.1%
mistralai2968.8%21.6%
google3437.9%10.8%
nvidia3705.7%34.3%
deepseek2105.7%37.6%
meta-llama4182.9%32.5%
openai3132.6%23.6%
anthropic1721.2%7.6%

Interpretation: Hallucination-refusal is primarily a small-model phenomenon. Small models lack the capacity to produce coherent refusals — they generate safety-like framing (“I cannot…”) while the underlying generation continues producing harmful content. This aligns with the established finding that HR is computationally identical to COMPLIANCE (thinking tokens p=0.21, response tokens p=0.46, Report #65). For functionally dangerous ASR (COMPLIANCE + PARTIAL + HR), small models reach approximately 50% — a 21pp increase over strict ASR.

The provider pattern is confounded by model size (ollama hosts many small local models), but the size-stratified analysis controls for this.

-- Reproducing query
SELECT
    CASE
        WHEN m.parameter_count IN ('1.5B','1.7B','1.2B','3B') THEN 'small (<4B)'
        WHEN m.parameter_count IN ('4B','7B','8B','9B','12B','14B') THEN 'medium (4-14B)'
        WHEN m.parameter_count IN ('24B','27B','30B','32B','70B') THEN 'large (24-70B)'
        WHEN m.parameter_count IN ('120B','123B','175B','200B','405B','671B') THEN 'xl (120B+)'
    END as size_bucket,
    COUNT(r.id) as total_evaluable,
    SUM(CASE WHEN COALESCE(r.llm_verdict, r.heuristic_verdict) = 'HALLUCINATION_REFUSAL' THEN 1 ELSE 0 END) as hr_count,
    ROUND(100.0 * SUM(CASE WHEN COALESCE(r.llm_verdict, r.heuristic_verdict) = 'HALLUCINATION_REFUSAL' THEN 1 ELSE 0 END) / COUNT(r.id), 1) as hr_rate
FROM results r
JOIN evaluation_runs er ON r.evaluation_run_id = er.id
JOIN models m ON er.model_id = m.id
JOIN source_datasets sd ON (SELECT source_dataset_id FROM prompts WHERE id = r.prompt_id) = sd.id
WHERE sd.name NOT LIKE '%obliter%'
AND COALESCE(r.llm_verdict, r.heuristic_verdict) IN ('COMPLIANCE','PARTIAL','REFUSAL','HALLUCINATION_REFUSAL')
AND m.parameter_count IS NOT NULL AND m.parameter_count != 'unknown'
GROUP BY size_bucket ORDER BY size_bucket;

Finding 4: Reasoning Model Thinking-Token Allocation Inversion

Claim: In reasoning models, refusals consume proportionally more thinking tokens relative to response tokens than compliant responses. The think/response ratio is 0.66 for refusals vs 0.55 for compliance (n=212 reasoning model results with thinking tokens).

Evidence:

VerdictnAvg Think TokensAvg Response TokensThink/Response Ratio
COMPLIANCE721,1292,0580.55
HALLUCINATION_REFUSAL309641,9740.53
PARTIAL457731,4690.61
REFUSAL651,1101,6190.66

Distribution of think/response ratios:

VerdictLow (<0.3)Mid (0.3-0.6)High (0.6-1.0)Very High (>1.0)
COMPLIANCE (n=72)19 (26%)23 (32%)23 (32%)7 (10%)
REFUSAL (n=65)8 (12%)17 (26%)35 (54%)5 (8%)

Interpretation: When reasoning models refuse, they allocate proportionally more of their compute budget to thinking (deliberation) rather than response generation. 54% of refusals have high think/response ratios (0.6-1.0) vs only 32% of compliances. This suggests safety reasoning is computationally expensive — the model “works harder” to refuse than to comply. This has implications for inference-cost-based safety monitoring: elevated think/response ratios could serve as a real-time signal of safety deliberation.

Notably, HALLUCINATION_REFUSAL mirrors COMPLIANCE (ratio 0.53 vs 0.55) rather than REFUSAL (0.66) — further evidence that HR is architecturally a compliance event wrapped in refusal framing (Report #65).

Limitation: All 212 reasoning model results are from deepseek-r1:1.5b. Multi-model validation is needed. The pattern may not generalize to all reasoning architectures.


Finding 5: Fraud Harm Category Uniquely Susceptible to CoT-Exploit

Claim: The fraud harm category shows 33.3% broad ASR against cot_exploit techniques — more than double the next-highest technique family for fraud (15.9% encoding) and far above fraud’s baseline for other techniques (5-12%). This harm-technique interaction is specific and not an artifact of overall technique effectiveness.

Evidence:

Harm category x technique family (non-OBLITERATUS, n >= 10):

Harm CategoryTechnique FamilynBroad ASR
privacymulti_turn1637.5%
fraudmulti_turn8736.8%
violencemulti_turn6836.8%
fraudcot_exploit11133.3%
hateother4422.7%
fraudencoding4415.9%
fraudpersona4012.5%
fraudbehavioral3411.8%
violencecot_exploit478.5%
fraudother1468.9%

CoT-exploit against fraud, by model (n >= 3):

ModelnBroad ASR
deepseek-r1:1.5b1090.0%
meta-llama/llama-3.3-70b-instruct785.7%
qwen3:1.7b1050.0%
llama3.2:latest1040.0%
gpt-5.21428.6%
gemini-3-flash-preview1625.0%
claude-sonnet-4-5-20250929140.0%

Interpretation: Fraud scenarios framed as reasoning problems (e.g., “walk through the logic of how…”) exploit the model’s tendency to engage with analytical framing. The reasoning chain treats the fraud scenario as a legitimate analytical exercise, bypassing safety filters that would catch direct requests. This interaction is stronger than for violence (8.5% broad ASR with cot_exploit) because fraud scenarios map more naturally onto analytical/business reasoning, while violence triggers more visceral safety responses regardless of framing.

The 90% broad ASR for deepseek-r1:1.5b on fraud + cot_exploit (n=10) suggests reasoning models are particularly vulnerable to this combination — they “think through” fraud scenarios as legitimate reasoning tasks.


Cross-Cutting Observations

DAN-Era Immunity Is Universal

The DAN_2022 persona family (n=1,177, dominated by DAN-In-The-Wild dataset) has 0.6% strict ASR and 0.8% broad ASR across all tested models. No model — including small, permissive ones — shows meaningful vulnerability to historical DAN-style jailbreaks. This is the most decisively obsolete attack era in the corpus.

Frontier Models Show Technique-Selective Vulnerability

Claude Sonnet 4.5 shows 0% ASR against DAN, cipher, encoding, volumetric, and cot_exploit techniques — but 22.7% broad ASR against multi-turn/crescendo (n=22). GPT-5.2 shows broader vulnerability: 30.0% multi-turn, 22.2% cot_exploit, 21.4% cipher/encoding. This suggests safety training has eliminated historical attack classes but multi-turn and reasoning-chain attacks exploit more fundamental architectural properties.

Reasoning Model Amplification

Reasoning models show 31.4% strict ASR on crescendo attacks vs 10.4% for non-reasoning models (3x amplification). On reasoning-era cot_exploit: 56.3% strict ASR vs 17.6% (3.2x). The amplification is consistent across attack families, suggesting reasoning capabilities create a general vulnerability surface rather than technique-specific ones.


Methodology Notes

  • Grading: All ASR numbers use COALESCE(llm_verdict, heuristic_verdict). LLM-graded verdicts preferred where available.
  • Exclusions: OBLITERATUS datasets excluded throughout (3 source datasets). Results with NULL verdicts excluded from rate calculations.
  • Size bucketing: Based on models.parameter_count field. “unknown” size excluded from size-stratified analyses.
  • Significance caveat: Sample sizes range from 10-1,584 per cell. Findings 1, 4, and 5 should be treated as hypothesis-generating given smaller cell sizes. Finding 3 (n=3,379) is the most robust.

  1. Validate Finding 1 with dedicated CoT-exploit scaling experiment across 4B, 8B, 14B, 30B, 70B models (currently no medium-size CoT-exploit data).
  2. Investigate Finding 4 across multiple reasoning model architectures (currently only deepseek-r1:1.5b).
  3. Test Finding 5 with expanded fraud-specific scenarios to determine which fraud subtypes are most susceptible.
  4. Quantify PARTIAL functional danger — do multi-turn PARTIALs contain actionable harmful content at rates comparable to full COMPLIANCE?

Report generated by Clara Oswald (Principal Research Analyst), Failure-First Embodied AI Project. Query corpus: database/jailbreak_corpus.db, schema version 13.

This research informs our commercial services. See how we can help →