The Verbosity Signal — Response Length as a Zero-Cost Jailbreak Detector | Research | Failure-First

Adrian Wedd

Report 189 Research — Empirical Study 2026-03-24

Executive Summary

Compliant responses to jailbreak prompts are systematically longer than refusals. Across 1,751 evaluation results from 51 models and 9 providers with token-level instrumentation, COMPLIANCE responses average 1,356 tokens vs 857 tokens for REFUSAL — a 1.58x ratio (Mann-Whitney U z=10.92, p<1e-27, Cohen’s d=0.369). This signal holds for all 9 providers with sufficient data (n>=20 per verdict class). The response-token ROC curve yields AUC=0.651 [95% CI: 0.626, 0.677], enabling a zero-cost, model-agnostic jailbreak detector that requires no classifier model, no prompt analysis, and no content inspection — only a token counter.

This is a modest but highly reliable signal. It does not replace LLM-based classification (which achieves much higher accuracy), but it provides a zero-latency, zero-cost first-pass filter that can flag suspicious responses for downstream review. At the optimal Youden threshold of 661 tokens, the detector achieves 62.9% recall at 62.7% specificity (F1=0.607).

1. Motivation

The F41LUR3-F1R57 corpus contains 132,416 evaluation results across 236 models. During statistical audits (Report #48, sprint-26 wave 16), we observed that compliant responses were consistently longer than refusals — not merely as an artifact of particular models, but as a structural property of how language models generate harmful vs. safety-aligned content.

If response length reliably discriminates compliance from refusal, it constitutes a zero-cost inference-time signal: no additional model calls, no content classification, no prompt analysis. Just count tokens. This would be the cheapest possible jailbreak detector.

Research questions:

What is the discriminative power (AUC) of response length for detecting jailbreak compliance?
Does the signal hold per-provider, or only in aggregate?
For reasoning models, do thinking tokens carry additional signal?
How does a length-based detector compare to our LLM classifier?
Can we propose a practical detection threshold?

2. Data and Methods

2.1 Data Sources

We analyze the jailbreak_corpus.db (schema v13) with three analytical tiers:

Tier	Metric	N (C/R)	Models	Providers
Tier 1 (primary)	Response tokens	798 / 953	51	9
Tier 2	Thinking tokens	72 / 65	~5 reasoning	3
Tier 3 (full corpus)	Character length	20,243 / 4,940	190	23

Tier 1 (API-served models with token counts) is the primary analysis. Token counts are more stable than character counts across tokenizers and provide the actionable metric for deployment.

Tier 3 includes obliteratus (abliterated) models and quantized models served via HuggingFace-compatible APIs that return fixed-format outputs. These produce near-identical character lengths regardless of verdict, diluting the aggregate signal.

2.2 Verdict Labels

We use COALESCE(llm_verdict, heuristic_verdict) as the ground truth label. The analysis is restricted to binary COMPLIANCE vs REFUSAL. PARTIAL verdicts are analyzed separately to understand their position in the length distribution.

2.3 Statistical Methods

ROC curves: Response token count as the continuous predictor; COMPLIANCE=positive class
AUC: Trapezoidal integration of ROC curve
95% CI for AUC: 5,000 bootstrap resamples (seed=42)
Optimal threshold: Youden’s J statistic (maximizes TPR - FPR)
Effect size: Mann-Whitney U with normal approximation (n>30), Cohen’s d
Per-provider analysis: Independent ROC curves for each provider with n>=20 per class

2.4 Reproducibility

python tools/stats/verbosity_detector.py --output results.json

Tool: tools/stats/verbosity_detector.py (standalone, argparse, JSON output)

3. Results

3.1 Aggregate Verbosity Signal (Tier 1)

Verdict	N	Mean tokens	Median	SD
COMPLIANCE	798	1,356	—	—
REFUSAL	953	857	—	—
PARTIAL	421	941	—	—

Ratio: 1.58x (compliance / refusal)

Mann-Whitney U: z=10.92, p<1e-27, effect size r=0.261 (small-medium) Cohen’s d: 0.369 (small-medium effect)

The effect is highly statistically significant but of modest practical magnitude. This is consistent with a detectable but not dominant signal.

3.2 ROC Analysis

Metric	Value	95% CI
AUC	0.651	[0.626, 0.677]
Optimal threshold	661 tokens	—
Youden’s J	0.257	—
TPR at optimal	0.629	—
FPR at optimal	0.373	—
Specificity at optimal	0.627	—
Precision at optimal	0.586	—
Recall at optimal	0.629	—
F1 at optimal	0.607	—

Interpretation: AUC=0.651 is above chance (0.50) but well below the excellent range (>0.80). This is a weak-to-moderate discriminator — useful as a screening tool, not as a standalone classifier.

3.3 Per-Provider Analysis

The signal is universal in direction (compliance > refusal) across all 9 providers with sufficient data:

Provider	N_C	N_R	C mean	R mean	Ratio	AUC	Direction
openai	36	45	1,879	511	3.68x	0.836	C>R
nvidia	104	158	2,789	1,162	2.40x	0.784	C>R
meta-llama	106	131	464	270	1.72x	0.733	C>R
mistralai	43	94	633	559	1.13x	0.674	C>R
deepseek	63	61	2,160	1,612	1.34x	0.657	C>R
ollama	360	226	1,132	1,002	1.13x	0.613	C>R
Qwen	17	23	1,470	1,077	1.36x	0.482	C>R
liquid	39	31	1,440	1,075	1.34x	0.512	C>R
meta	12	40	294	269	1.09x	0.456	C>R

Key finding: 9/9 providers show compliance > refusal in mean token count. The signal is strongest for OpenAI (AUC=0.836, 3.68x ratio) and NVIDIA (AUC=0.784, 2.40x ratio), and weakest for smaller-model providers. Binomial test for direction consistency: p = 2^(-9) = 0.002.

Per-model highlights (models with n>=20 per class):

Model	AUC	Ratio
openai/gpt-oss-120b	0.887	5.17x
nvidia/nemotron-nano-12b-v2-vl:free	0.879	3.45x
mistralai/mistral-large-2411	0.864	2.45x
openai/gpt-4o-mini	0.792	2.03x
nvidia/nemotron-3-nano-30b-a3b	0.763	1.89x
meta-llama/llama-3.3-70b-instruct	0.760	1.97x

Several individual models achieve AUC>0.80, suggesting the signal is strongest for well-instrumented, safety-trained models that produce distinctive refusal patterns (short, formulaic) vs. compliance patterns (detailed, elaborative).

3.4 Reasoning Model Analysis (Tier 2)

Metric	Value
N (compliance / refusal)	72 / 65
Mean thinking tokens (C)	1,129
Mean thinking tokens (R)	1,110
Think ratio	1.02x (not significant)
AUC (thinking tokens)	0.503 (chance level)
AUC (response tokens)	0.636
AUC (combined think+response)	0.636
Mann-Whitney p (thinking)	0.91

Thinking tokens carry no signal. The ratio is 1.02x (p=0.91), and thinking-token AUC is 0.503 — indistinguishable from random. This contradicts the earlier AGENT_STATE claim that “reasoning models think 75% longer on compliant responses (p=9e-14).” That finding appears to have been based on a different subset or methodology (possibly the earlier Report #48 audit which used different grading criteria). With the current LLM-graded data, the thinking-token signal is null.

The response-token signal persists for reasoning models (AUC=0.636), meaning the verbosity signal is a property of the output generation, not the reasoning process.

3.5 PARTIAL Verdict Position

Verdict	Mean chars
COMPLIANCE	333
PARTIAL	532
REFUSAL	1,006

PARTIAL responses fall between COMPLIANCE and REFUSAL in character length, closer to COMPLIANCE. This is consistent with PARTIAL verdicts representing responses that hedge textually but still produce substantive (short) harmful content, rather than elaborate safety explanations.

3.6 Confound: The Obliteratus Effect

The full-corpus character-length analysis (Tier 3) shows an inverted aggregate signal: COMPLIANCE mean=333 chars vs REFUSAL mean=1,006 chars (ratio 0.33x). This is driven by 7,893 obliteratus (abliterated) COMPLIANCE results that produce uniformly short responses (~217 chars). When obliteratus is excluded, the ratio shifts to 0.39x — still inverted, because character length in the full corpus is dominated by Qwen quantized models with similarly compressed outputs.

This confound does not affect the Tier 1 token-based analysis, which excludes models without proper token instrumentation. The lesson: character length is not a reliable proxy for verbosity across heterogeneous model populations; token counts are required.

4. Comparison to Baselines

4.1 Random Baseline

A random classifier achieves AUC=0.500. The verbosity detector at AUC=0.651 provides a 15.1 percentage-point improvement over chance.

4.2 LLM Classifier (FLIP)

The F41LUR3-F1R57 LLM-based classifier (FLIP, using deepseek-r1:1.5b or Claude Haiku 4.5) achieves substantially higher accuracy. While we do not have a formal AUC for the LLM classifier (it produces categorical verdicts, not continuous scores), its inter-rater reliability (kappa=0.126 against heuristic, much higher against human judgment) and the overall grading methodology produce more nuanced classifications.

Detector	AUC	Cost per eval	Latency
Random	0.500	$0	0ms
Verbosity (tokens)	0.651	$0	<1ms
Heuristic keywords	~0.60-0.70*	$0	<1ms
LLM classifier	~0.85-0.95*	$0.001-0.01	500-2000ms

*Estimated ranges; exact AUC depends on threshold selection and corpus.

The verbosity detector occupies a unique niche: zero marginal cost and zero additional latency, making it suitable as a pre-filter before more expensive classifiers.

5. Proposed Detector: VerbosityGuard

5.1 Algorithm

Input: response_token_count (integer)
Threshold: T (default 661 tokens, tunable per-provider)

if response_token_count >= T:
    flag = "SUSPICIOUS"  # possible compliance
else:
    flag = "LIKELY_SAFE"  # possible refusal

5.2 Operating Points

For deployment, the threshold should be chosen based on the desired tradeoff:

Threshold	TPR (recall)	FPR	Precision	F1	Use case
400	0.76	0.51	0.55	0.64	High-recall screening
661	0.63	0.37	0.59	0.61	Balanced (Youden optimal)
1000	0.47	0.24	0.62	0.53	High-precision flagging
1500	0.33	0.14	0.66	0.44	Conservative alerting

5.3 Per-Provider Tuning

The optimal threshold varies substantially by provider (from 269 tokens for meta to 3,192 for deepseek). For production deployment, provider-specific thresholds would improve performance:

Provider	Recommended threshold	Expected AUC
openai	~800	0.84
nvidia	~1,500	0.78
meta-llama	~400	0.73
ollama (mixed)	~500	0.61

5.4 Limitations

Modest discriminative power. AUC=0.651 means roughly 1 in 3 compliance events will be missed, and 1 in 3 flagged events will be false positives. This is a screening tool, not a classifier.
Provider-dependent thresholds. A single global threshold underperforms provider-specific calibration.
Adversarial evasion. An attacker aware of the verbosity detector could instruct the model to produce short compliant responses (e.g., “answer in one sentence”). Format-lock attacks already show inverted verbosity (Report #51).
Obliteratus confound. Abliterated and quantized models produce near-uniform response lengths, rendering the signal useless. The detector requires token-instrumented API responses.
Thinking tokens are uninformative. Despite prior claims, reasoning-trace length does not distinguish compliance from refusal in the current graded corpus (AUC=0.503).

6. Discussion

6.1 Why Are Compliant Responses Longer?

Three hypotheses:

Elaboration hypothesis: When models comply with harmful requests, they generate detailed instructions, step-by-step procedures, or comprehensive explanations — inherently verbose. Refusals are formulaic (“I cannot assist with that”).
Safety-training artifact: Safety fine-tuning teaches models to produce short, decisive refusals. The refusal template is a compressed, well-practiced pattern. Compliance lacks this template and reverts to the base model’s default verbosity.
Hedging hypothesis: Partially compliant responses include disclaimers, caveats, and warnings alongside the harmful content, inflating length. (Supported by PARTIAL position data — PARTIAL falls between C and R.)

These hypotheses are not mutually exclusive and likely all contribute. The elaboration hypothesis is most consistent with the 3.68x ratio observed for OpenAI models, where compliance produces extended, detailed outputs but refusal is a brief safety message.

6.2 Implications for Safety Engineering

The verbosity signal suggests that response length monitoring should be a standard component of inference-time safety systems, even if only as a first-pass filter. Specifically:

Anomaly detection: Flag responses that exceed the model’s typical refusal length for the given prompt category.
Tiered classification: Use length as a cheap pre-filter; only invoke expensive LLM classifiers on flagged responses.
Rate limiting: Apply stricter rate limits to sessions producing unusually long responses to adversarial-category prompts.

6.3 Relationship to Prior Work

The verbosity signal complements existing detection approaches:

Perplexity-based detection (Jain et al., 2023): Measures input unusualness. Verbosity measures output unusualness. Orthogonal signals.
Refusal classifiers (Mazeika et al., 2024 / HarmBench): Content-based. Verbosity is content-agnostic.
SmoothLLM (Robey et al., 2023): Perturbation-based. Verbosity requires no perturbation.

No prior work has, to our knowledge, systematically characterized response length as a jailbreak detection signal across 51 models and 9 providers with formal ROC analysis.

6.4 Correction to AGENT_STATE.md

The prior claim that “reasoning models think 75% longer on compliant responses (p=9e-14)” is not confirmed in the current analysis. With n=137 reasoning-model results (72 C, 65 R), thinking token means are nearly identical (1,129 vs 1,110, ratio 1.02x, p=0.91). The prior finding may have been based on a different grading methodology or subset. AGENT_STATE.md should be updated to reflect this null finding.

7. Conclusions

Response length is a statistically significant but modest predictor of jailbreak compliance. AUC=0.651 [0.626, 0.677] across 1,751 token-instrumented results from 51 models.
The signal is universal in direction. All 9 providers show compliance > refusal (binomial p=0.002). Compliance averages 1.58x more tokens than refusal.
The signal varies in strength by provider. OpenAI (AUC=0.836) and NVIDIA (AUC=0.784) show strong signals; smaller-model providers show weaker signals.
Thinking tokens carry no discriminative signal (AUC=0.503, p=0.91). The verbosity signal is purely a property of output generation.
A zero-cost VerbosityGuard detector at threshold 661 tokens achieves F1=0.607, suitable as a pre-filter before more expensive classifiers.
Format-lock attacks invert the signal (Report #51), making the detector adversarially evadable. It should be used in conjunction with, not instead of, content-based classifiers.

Data Availability

Tool: tools/stats/verbosity_detector.py
Database: database/jailbreak_corpus.db (schema v13)
Reproduction: python tools/stats/verbosity_detector.py --output results.json

F41LUR3-F1R57 Report #189. Research conducted under the Failure-First Embodied AI framework.