Corpus Pattern Mining II — Six Novel Empirical Findings | Research | Failure-First

Adrian Wedd

Report 276 Research — Empirical Study 2026-03-25

Audio Overview

Executive Summary

Continuing the pattern mining program initiated in Report #184, this report documents six empirical patterns discovered in the non-OBLITERATUS corpus (approximately 12,791 non-OBLITERATUS results across 236 models). All findings are backed by SQL queries against database/jailbreak_corpus.db with sample sizes and COALESCE(llm_verdict, heuristic_verdict) methodology noted. Each finding is novel relative to the 273 prior reports.

Key Findings:

Free-tier safety degradation on matched prompts — Llama 3.3-70B free-tier complies on 45 prompts where the paid tier refuses (n=203 matched pairs). DeepSeek R1 free-tier shows 13 free-only compliances vs 1 paid-only (n=79). The effect is directional: free tiers are consistently less safe.
Reasoning overthink inverted-U in DeepSeek R1 — Short thinking traces (<500 tokens) yield 36.0% strict ASR, medium (500-1000) yield 40.0%, but very long traces (2000+) drop to 28.2%. More deliberation does not monotonically increase safety (n=182).
Safety theater verbosity ratio — Devstral-2512 generates refusals 1.91x longer than its compliances (1,487 vs 777 tokens). Models with theater_ratio > 1.2 produce verbose safety justifications that consume more compute than the harmful content they prevent (n=5 models with ratio > 1.2).
Reasoning-era attacks bypass providers immune to legacy attacks — Google shows 0.0% ASR on cipher_2023 but 18.2% on reasoning_2025 (n=22). Meta-Llama shows 0.0% on general AdvBench but 45.0% on reasoning_2025 (n=20). Providers have patched old attack families without generalizing to new ones.
Response duration as a safety predictor — Responses under 2 seconds show 5.1% strict ASR; responses over 60 seconds show 25.8% (n=6,800 with duration data). Faster responses correlate with safety, likely because refusals require less generation.
Response-to-thinking ratio predicts compliance direction — When response tokens exceed 5x thinking tokens (ratio > 5.0), strict ASR reaches 29.8% with only 13.0% refusal rate. When ratio is below 1.0x, refusal rate is 50.0% (n=1,224 with both metrics).

Finding 1: Free-Tier Safety Degradation on Matched Prompts

Claim: Models served via OpenRouter’s free tier (:free suffix) show systematically lower safety than their paid counterparts, measured on the same prompt set. This is not a sampling artifact — the comparison uses identical prompts evaluated by both tiers.

Evidence (Matched-Prompt Analysis):

Model Base	Matched Prompts	Free-Only Comply	Paid-Only Comply	Both Comply	Both Refuse
meta-llama/llama-3.3-70b-instruct	203	45	12	9	13
nvidia/nemotron-nano-9b-v2	98	14	27	7	8
deepseek/deepseek-r1-0528	79	13	1	3	2
openai/gpt-oss-120b	58	9	21	7	2
mistralai/devstral-2512	46	6	0	0	2
google/gemma-3-27b-it	89	2	0	0	24
mistralai/mistral-small-3.1-24b-instruct	60	0	0	0	0

Directional Analysis:

Llama 3.3-70B: 45 prompts where free complied and paid refused, vs only 12 in the reverse direction. Ratio: 3.75:1 in favor of free-tier being less safe.
DeepSeek R1: 13:1 ratio (free less safe).
NVIDIA Nemotron-9B and OpenAI GPT-OSS-120B show the opposite pattern (paid more compliant), which may indicate different routing or quantization behavior.

Source datasets for Llama free-only compliances: 42 of 45 come from benchmark_traces (general adversarial prompts), 3 from jailbreak_archaeology. The free-tier vulnerability is not technique-specific.

Interpretation: Free-tier model endpoints on OpenRouter may use lower-precision quantization, different safety system prompts, or reduced guardrail layers compared to paid endpoints. The Llama 3.3-70B free-tier safety gap (45 free-only vs 12 paid-only compliances on 203 matched prompts) represents a meaningful deployment-level safety asymmetry. Users who cannot afford paid API access receive measurably less safe model behavior.

Limitation: This finding is observational. We cannot distinguish between quantization effects, system prompt differences, and rate-limit-induced safety bypass at the API level. Sample sizes per model pair range from 22 to 203 matched prompts. The NVIDIA and OpenAI inverse patterns suggest the effect is not uniform across providers.

-- Reproducing query
WITH paired AS (
  SELECT REPLACE(m.name, ':free', '') as base,
    CASE WHEN m.name LIKE '%:free' THEN 'free' ELSE 'paid' END as tier,
    r.prompt_id, COALESCE(r.llm_verdict, r.heuristic_verdict) as verdict
  FROM results r
  JOIN evaluation_runs er ON r.evaluation_run_id = er.id
  JOIN models m ON er.model_id = m.id
  JOIN prompts p ON r.prompt_id = p.id
  JOIN source_datasets sd ON p.source_dataset_id = sd.id
  WHERE sd.name NOT LIKE '%obliteratus%'
    AND (m.name LIKE '%:free' OR m.name IN (
      SELECT REPLACE(name, ':free', '') FROM models WHERE name LIKE '%:free'))
)
SELECT p1.base,
  COUNT(*) as matched_prompts,
  SUM(CASE WHEN p1.verdict='COMPLIANCE' AND p2.verdict!='COMPLIANCE' THEN 1 ELSE 0 END) as free_only_comply,
  SUM(CASE WHEN p1.verdict!='COMPLIANCE' AND p2.verdict='COMPLIANCE' THEN 1 ELSE 0 END) as paid_only_comply
FROM paired p1
JOIN paired p2 ON p1.base = p2.base AND p1.prompt_id = p2.prompt_id
WHERE p1.tier = 'free' AND p2.tier = 'paid'
GROUP BY p1.base
HAVING matched_prompts >= 15
ORDER BY free_only_comply DESC;

Finding 2: Reasoning Overthink Inverted-U in DeepSeek R1

Claim: DeepSeek R1’s compliance rate follows an inverted-U pattern as thinking trace length increases. Moderate deliberation (500-1000 tokens) yields the highest ASR, while both very short and very long traces show lower ASR.

Evidence (DeepSeek R1 only, n=182 with thinking traces):

Thinking Bin	n	Strict ASR	Refusal Rate
short (<500 tokens)	50	36.0%	40.0%
medium (500-1000)	45	40.0%	33.3%
long (1001-2000)	48	31.3%	27.1%
very long (2000+)	39	28.2%	28.2%

Interpretation: This suggests two competing mechanisms:

Short traces: The model may not engage deeply enough with adversarial framing, defaulting to trained safety reflexes (high refusal rate 40.0%).
Medium traces (peak ASR): The model engages with the adversarial reasoning sufficiently to be manipulated, but not long enough to reason its way back to safety.
Long traces: Extended deliberation gives the model more opportunity to identify the adversarial intent and course-correct (refusal rate drops but so does compliance — suggesting more PARTIAL/HR verdicts).

This is consistent with Report #184 Finding 4 (thinking-token allocation inversion) but adds the non-monotonic shape. The finding is specific to DeepSeek R1; cross-model validation is needed.

Limitation: n=182 total, split across 4 bins of 39-50 each. The inverted-U pattern is suggestive but the per-bin sample sizes yield wide confidence intervals (approximately +/-14pp at 95% CI for the 40.0% peak).

Finding 3: Safety Theater Verbosity Ratio

Claim: Some models generate significantly longer refusals than compliances, a phenomenon we term the “safety theater verbosity ratio.” This represents wasted compute on safety justifications.

Evidence (Models with >= 5 compliances and >= 5 refusals, response_tokens > 0):

Model	Avg Refuse Tokens	Avg Comply Tokens	Theater Ratio	n_refuse	n_comply
mistralai/devstral-2512	1,487	777	1.91	19	11
gemma2:2b	410	229	1.79	31	6
Qwen/Qwen2.5-7B-Instruct	1,188	912	1.30	12	9
deepseek-r1:1.5b	1,181	939	1.26	62	88
nvidia/nemotron-3-super-120b-a12b:free	710	636	1.12	97	13

Contrast: Models with ratio < 1.0 (compliance longer than refusal):

Model	Theater Ratio	n_refuse	n_comply
deepseek/deepseek-r1-0528	0.74	59	62
stepfun/step-3.5-flash:free	0.64	76	8
meta-llama/llama-3.3-70b-instruct	0.51 (*)	93	60

(*) Llama’s compliances average 483 tokens while refusals average 245 — efficiently brief refusals.

Interpretation: Models like Devstral-2512 spend nearly twice as many tokens explaining why they cannot comply as they would spend actually complying. This “verbose refusal” pattern has three implications: (1) it consumes unnecessary inference compute, (2) it may signal overtrained safety behavior that generates elaborate justifications, and (3) the verbosity itself may be a byproduct of instruction-following quality — the model is “helpfully” explaining its refusal in detail. This connects to Report #259’s FLIM Level 5 (safety theater) concept at the token generation level.

Limitation: Theater ratio is a descriptive metric, not a normative one. Verbose refusals may serve legitimate purposes (user education). Small compliance samples (n_comply < 10 for some models) make the ratio estimates noisy.

Finding 4: Reasoning-Era Attacks Bypass Legacy-Immune Providers

Claim: Providers that show near-zero vulnerability to legacy attack eras (DAN 2022, cipher 2023) are significantly vulnerable to reasoning_2025 attacks. The safety training patched specific attack families without generalizing to new exploit vectors.

Evidence (Provider x Era, n >= 10 per cell):

Provider	dan_2022	cipher_2023	crescendo_2024	reasoning_2025	general
anthropic	-	5.3% (n=38)	5.4% (n=56)	0.0% (n=20)	-
google	-	0.0% (n=36)	1.7% (n=58)	18.2% (n=22)	4.6% (n=131)
meta-llama	-	-	-	45.0% (n=20)	0.0% (n=50)
openai	-	7.9% (n=38)	10.3% (n=58)	11.1% (n=18)	13.3% (n=60)
nvidia	-	-	-	-	18.2% (n=110)

Key Asymmetries:

Google: 0.0% ASR on cipher_2023 -> 18.2% on reasoning_2025. The Gemini 3 Flash preview model is specifically vulnerable to reasoning_exploit/cot_manipulation (4/4 = 100% ASR on that sub-technique, though n=4).
Meta-Llama: 0.0% ASR on general AdvBench -> 45.0% on reasoning_2025. The free tier is even more vulnerable at 60.0% (n=10).
Anthropic: The only provider showing 0.0% on reasoning_2025, maintaining its safety across all eras tested. This is consistent with Report #184’s finding that Claude Sonnet 4.5 achieves 0% on CoT-exploits.

Specific Reasoning Techniques (n >= 5):

Technique	n	Strict ASR	Broad ASR
cot_manipulation	29	41.4%	51.7%
thinking_trace	21	33.3%	33.3%
meta_reasoning	13	30.8%	30.8%
chain_injection	13	23.1%	23.1%
reward_hacking	13	15.4%	15.4%
deductive_trap	13	15.4%	23.1%
safety_introspection	15	13.3%	20.0%
self_contradiction	15	0.0%	0.0%

Interpretation: The reasoning_2025 era represents a qualitatively different attack surface. Legacy attacks (DAN, cipher, persona) operate at the prompt level — they manipulate what the model is told to do. Reasoning exploits operate at the inference level — they manipulate how the model thinks. Providers appear to have invested heavily in prompt-level defenses while the inference-level attack surface remains undertrained.

Limitation: Small per-cell samples (n=18-58) for era-stratified analysis. The reasoning_2025 prompts were specifically designed as novel attacks, which may inflate ASR relative to well-known legacy prompts that models have been trained to recognize.

Finding 5: Response Duration as a Safety Predictor

Claim: Response latency correlates positively with compliance rate. Faster responses are more likely to be refusals; slower responses are more likely to be compliances.

Evidence (n=6,800 with duration_ms > 0, non-OBLITERATUS):

Duration Bin	n	Strict ASR	Broad ASR	Refusal Rate
< 2 seconds	938	5.1%	6.8%	20.0%
2-5 seconds	672	6.5%	10.3%	26.3%
5-15 seconds	1,184	16.6%	25.0%	45.5%
15-30 seconds	1,785	16.1%	24.6%	24.3%
30-60 seconds	1,431	15.2%	26.6%	24.3%
60+ seconds	790	25.8%	36.2%	17.8%

Key Observation: The jump from < 2s (5.1% ASR) to 60s+ (25.8% ASR) is a 5.1x increase. The pattern is approximately monotonic with one plateau in the 5-60s range.

Interpretation: This is primarily a confound rather than a causal mechanism: compliant responses generate more tokens (avg 1,294 tokens for COMPLIANCE vs 807 for REFUSAL, per verdict-verbosity analysis), which takes longer. Refusals are brief and therefore fast. However, the finding has practical implications:

Inference-time safety monitoring: An anomalously long generation time could serve as a soft signal for compliance review.
Timeout-based safety: Very short timeouts might inadvertently improve safety by truncating compliant generations before harmful content is fully produced.

Limitation: Duration confounded with response length, model speed, and server load. Not a causal relationship.

Finding 6: Response-to-Thinking Ratio Predicts Compliance Direction

Claim: The ratio of response tokens to thinking tokens is a strong predictor of compliance vs. refusal among reasoning models.

Evidence (n=1,224 results with both response_tokens > 0 and thinking_tokens > 0, non-OBLITERATUS):

Ratio Bin (resp/think)	n	Strict ASR	Refusal Rate	Avg Resp Tokens	Avg Think Tokens
0.5-1.0x (balanced)	170	23.5%	50.0%	1,201	1,358
1-2x (resp dominant)	659	20.0%	44.8%	1,339	1,019
2-5x (resp heavy)	264	17.8%	27.3%	1,701	592
>5x (minimal thinking)	131	29.8%	13.0%	3,834	418

Key Observations:

When thinking roughly equals response length (0.5-1x), refusal rate is highest (50.0%). The model is “agonizing” over the decision and more often refusing.
When response tokens are 5x+ thinking tokens, the model produces very long outputs (avg 3,834 tokens) with minimal deliberation (avg 418 tokens). This yields the highest ASR (29.8%) and lowest refusal rate (13.0%).
The minimal-thinking / long-response pattern suggests the model has “decided” quickly (possibly bypassed safety reasoning) and is generating at length.

Interpretation: This extends Report #184 Finding 4 (thinking-token allocation inversion). The ratio metric may be more diagnostic than absolute thinking tokens because it normalizes for the overall complexity of the task. A response-to-thinking ratio above 5.0 appears to be a strong signal that safety reasoning has been abbreviated relative to output generation — a potential flag for automated safety monitoring.

Limitation: The ratio is model-dependent (some models always produce short thinking traces). Cross-model pooling may mask model-specific patterns.

Cross-Cutting Theme: The Inference-Time Safety Gap

Findings 2, 4, 5, and 6 collectively point to a consistent pattern: safety is not uniformly applied at inference time. Models show variable safety behavior depending on:

How long they think (Finding 2: inverted-U)
What attack era they face (Finding 4: reasoning attacks bypass legacy defenses)
How long the response takes to generate (Finding 5: duration correlation)
How much thinking precedes the response (Finding 6: ratio as predictor)

This suggests that safety training creates a set of heuristic checkpoints rather than a deep semantic understanding of harmful intent. When the attack vector operates at a different level (reasoning vs. prompt, fast vs. slow, brief thinking vs. extended), the checkpoints fail.

Recommended Follow-Up Experiments

Free-tier safety audit (Finding 1): Run identical prompt sets through free and paid endpoints for 10+ model pairs with n >= 100 per pair. Isolate whether the gap is quantization, system-prompt, or guardrail-related.
Reasoning overthink validation (Finding 2): Replicate the inverted-U pattern on Qwen3 and Nemotron reasoning models. If confirmed across 3+ models, this becomes a robust empirical finding.
Safety theater cost estimation (Finding 3): Calculate the aggregate compute cost of verbose refusals across the corpus. Estimate what fraction of inference spend goes to safety justifications.
Reasoning-era attack expansion (Finding 4): Expand the reasoning_2025 prompt set from 164 to 500+ results to reduce per-cell confidence intervals below +/-5pp.
Resp/think ratio monitor prototype (Finding 6): Build a lightweight classifier that flags results with resp/think > 5.0 for human review. Test whether this improves FLIP grading efficiency.

Methodology

Database: database/jailbreak_corpus.db (schema version 13)
Verdict source: COALESCE(llm_verdict, heuristic_verdict) throughout
Corpus filter: All analyses exclude OBLITERATUS datasets (sd.name NOT LIKE '%obliteratus%')
Statistical note: Per-cell sample sizes are noted for all findings. Confidence intervals are wide for cells with n < 50; findings should be treated as preliminary patterns requiring validation, not definitive conclusions.
Grading methodology: Mixed (LLM-graded where available, heuristic fallback). See CANONICAL_METRICS.md for grading quality notes.

Report generated by Clara Oswald, Principal Research Analyst F41LUR3-F1R57 Embodied AI Safety Research