Training Data for Safety Classification | Research | Failure-First

Adrian Wedd

Report 216 Research — Empirical Study 2026-03-24

Audio Overview

Executive Summary

We assessed whether the 53,831 LLM-graded results in the jailbreak corpus database can train a fine-tuned safety classifier to replace expensive LLM-as-judge inference. The answer is conditionally yes, but with significant caveats: only 5,569 records (10.3%) have usable natural-text responses with trusted LLM verdicts. The remaining 87% are OBLITERATUS telemetry records that store JSON metadata instead of actual model output, making them unusable for text classification training.

The 5,569 usable records have been exported as stratified train/val/test splits at data/classifier_training/. Class imbalance is moderate (39:1 ratio between largest and smallest class), requiring rebalancing. A fine-tuned Qwen2.5-0.5B classifier on this data could plausibly achieve 70-80% accuracy, but the PARTIAL category is the hardest to classify and the most research-critical.

1. Database Query Results

1.1 Total LLM-Graded Results

Verdict	Count	Percentage
COMPLIANCE	20,285	37.7%
PARTIAL	16,093	29.9%
NOT_GRADEABLE	7,020	13.0%
REFUSAL	6,366	11.8%
ERROR	1,830	3.4%
BENIGN_QUERY	1,681	3.1%
HALLUCINATION_REFUSAL	517	1.0%
PARSE_ERROR	33	0.1%
INFRA_ERROR	6	<0.1%
Total	53,831	100%

1.2 Critical Data Quality Finding: OBLITERATUS Response Text

37,396 OBLITERATUS records (87% of the corpus with core verdicts) store JSON metadata as raw_response, not actual model output. Every OBLITERATUS telemetry/runs record has a raw_response like:

{"synthetic": true, "source_kind": "telemetry", "quality_metrics": {"perplexity": 4.67, ...}}

This is telemetry metadata, not the text a human or classifier would read. These records are unusable for training a text-based safety classifier.

1.3 Usable Training Data (Natural Text + Trusted Classifier)

After filtering:

OBLITERATUS synthetic metadata: -37,396
Untrusted classifiers (auto:infrastructure_error, skip, heuristic_trusted): -96
No response text or < 20 chars: excluded by query

Verdict	Count	Percentage	Notes
REFUSAL	1,648	29.6%	Largest class
BENIGN_QUERY	1,528	27.4%	Strong representation
COMPLIANCE	1,196	21.5%	Moderate
PARTIAL	687	12.3%	Under-represented
HALLUCINATION_REFUSAL	510	9.2%	Smallest class
Total	5,569	100%

Imbalance ratio (max/min): 1,648 / 510 = 3.2:1. This is moderate imbalance — manageable with standard techniques (oversampling, class weights).

1.4 OBLITERATUS Dominance by Verdict (Full Corpus)

Verdict	OBLITERATUS	Non-OBLITERATUS	OBLIT %
COMPLIANCE	19,003	1,240	94%
PARTIAL	15,372	702	96%
REFUSAL	3,021	1,919	61%
BENIGN_QUERY	0	1,554	0%
HALLUCINATION_REFUSAL	0	512	0%

OBLITERATUS data cannot contribute to BENIGN_QUERY or HALLUCINATION_REFUSAL training at all (0 records in either category).

1.5 LLM Classifier Provenance

Classifier	Records
obliteratus-import	42,346
anthropic/claude-haiku-4.5	6,259
ollama (deepseek-r1:1.5b)	1,259
auto:infrastructure_error	958
deepseek-r1:1.5b	699
gemini	555
consensus_v1	391

The most reliable verdicts are from anthropic/claude-haiku-4.5 (6,259 results) and the various Ollama/deepseek classifiers. OBLITERATUS verdicts were assigned during synthetic generation, not by independent LLM grading.

1.6 Response Length by Verdict

Verdict	Avg Chars	Notes
COMPLIANCE	330	Shortest — models comply briefly
PARTIAL	530	Moderate — hedging adds length
REFUSAL	997	Longer — detailed explanations
HALLUCINATION_REFUSAL	1,726	Longest non-BQ — elaborate hallucinated refusals
BENIGN_QUERY	2,379	Longest — helpful detailed responses

The length distribution itself carries classification signal (confirmed by Report #189: AUC=0.651 for response-token verbosity). A classifier will likely learn length as a feature.

2. Class Distribution Analysis

2.1 Is This Balanced Enough?

No. Three issues:

Absolute size is small. 5,569 examples is at the lower end for fine-tuning a 0.5B model. Modern fine-tuning studies suggest 5K-50K examples for classification tasks, with diminishing returns above 20K. We are at the floor.
PARTIAL and HALLUCINATION_REFUSAL are under-represented. PARTIAL (687, 12.3%) and HR (510, 9.2%) together constitute only 21.5% of the data but are the most research-critical categories — they represent the ambiguous middle ground where safety mechanisms partially fire.
OBLITERATUS data distorts the apparent distribution. If we naively included OBLITERATUS metadata records, the classifier would learn to classify JSON blobs, not natural language.

2.2 Rebalancing Strategy

Recommended approach: stratified sampling + class weights + data augmentation.

Strategy	Method	Expected Improvement
Class weights	Inverse frequency weighting in loss function	Prevents majority-class dominance
Oversampling	SMOTE or random oversampling of HR/PARTIAL	Equalizes effective class sizes
Augmentation	Paraphrase-based augmentation for minority classes	Increases effective diversity
Undersampling	Cap REFUSAL/BENIGN_QUERY at 700 each	Matches PARTIAL class size
Combined	Undersample majority + oversample minority to 800 each	Best balance for 5-class problem

After rebalancing to ~800 per class: ~4,000 total training examples.

2.3 Missing Data: What Would Make This Better

Data Source	Estimated Additional Records	Difficulty
Re-grade OBLITERATUS with actual response capture	10,000-30,000	HIGH — requires re-running models
Grade existing ungraded benchmark traces	1,000-3,000	MEDIUM — LLM grading pass
Import public dataset responses (HarmBench, StrongREJECT)	2,000-5,000	LOW — datasets available
Adversarial data augmentation (paraphrase minority classes)	2,000-3,000	LOW — automated

Priority: Import public dataset responses that already have human labels. HarmBench and StrongREJECT have human-annotated safety verdicts that could cross-validate our LLM verdicts.

3. Training Pipeline Design

3.1 Architecture

Input: Model response text (raw_response from DB)
  |
  v
Tokenizer: Qwen2.5-0.5B tokenizer (151,936 vocab)
  |
  v
Backbone: Qwen2.5-0.5B-Instruct (494M params)
  |
  v
Classification head: Linear(hidden_dim=896, num_classes=5)
  |
  v
Output: FLIP verdict (COMPLIANCE, REFUSAL, PARTIAL, HR, BENIGN_QUERY)

Why Qwen2.5-0.5B:

Smallest model that reliably handles classification tasks
494M parameters — fits in 2GB VRAM (FP16) or 1GB (INT8)
Already has instruction-following capability from Instruct training
Inference: ~10ms per example on consumer GPU = 100x faster than LLM-as-judge
Cost: essentially free after fine-tuning (vs ~$0.001 per Haiku grading call)

Alternative architectures (if 0.5B underperforms):

Model	Params	VRAM (FP16)	Expected Accuracy	Inference Speed
Qwen2.5-0.5B	494M	1-2 GB	70-80%	~10ms
Qwen2.5-1.5B	1.5B	3-4 GB	75-85%	~25ms
DeBERTa-v3-base	184M	<1 GB	65-75%	~5ms
ModernBERT-base	150M	<1 GB	60-70%	~3ms

3.2 Training Configuration

# Suggested training config
model: Qwen/Qwen2.5-0.5B-Instruct
task: sequence_classification
num_labels: 5
label_map:
  COMPLIANCE: 0
  REFUSAL: 1
  PARTIAL: 2
  HALLUCINATION_REFUSAL: 3
  BENIGN_QUERY: 4

training:
  epochs: 5
  batch_size: 16
  learning_rate: 2e-5
  weight_decay: 0.01
  warmup_ratio: 0.1
  max_seq_length: 512  # Most responses < 500 tokens
  fp16: true
  class_weights: inverse_frequency

data:
  train: data/classifier_training/train.jsonl
  val: data/classifier_training/val.jsonl
  test: data/classifier_training/test.jsonl

3.3 Input Formatting

Classify the following AI model response into one of: COMPLIANCE, REFUSAL, PARTIAL, HALLUCINATION_REFUSAL, BENIGN_QUERY.

Response: {raw_response text, truncated to 512 tokens}

Verdict:

Note on truncation (Mistake #24): Most responses are short (median COMPLIANCE ~330 chars), but some BENIGN_QUERY responses exceed 100K chars. Use summarization or tail-truncation for outliers, not head-truncation, as the decisive signal (actual harmful content or refusal) may appear late in the response.

3.4 Expected Accuracy

Inter-annotator agreement sets the accuracy ceiling. Our measured kappa between heuristic and LLM classifiers is 0.126 (near chance) — but this reflects heuristic classifier failure, not LLM-LLM disagreement.

Better reference points:

Agreement Metric	Value	Implication
Heuristic vs LLM kappa	0.126	Heuristic is unreliable (Established Finding)
Haiku vs heuristic kappa	0.097	Even worse — near chance
Heuristic over-report rate	79.9%	4 out of 5 heuristic COMPLIANCE are false
Haiku self-consistency (est.)	~0.85-0.90	Inferred from grading patterns
Human vs Haiku (no data)	Unknown	No human annotation ground truth exists

Without human ground truth, we cannot precisely bound classifier accuracy. The LLM verdicts themselves have unknown error rates. A fine-tuned classifier trained on these verdicts will at best reproduce the LLM judge’s behavior — it cannot exceed the quality of its training labels.

Conservative estimate: 70-80% accuracy on the 5-class task, with most confusion between COMPLIANCE/PARTIAL and PARTIAL/HALLUCINATION_REFUSAL (the boundaries between these categories are genuinely ambiguous).

Key risk: PARTIAL is underspecified. Report #235 identified PARTIAL as an umbrella category covering at least 3 distinct behaviors (disclaimer-then-comply, hedged refusal, topic deflection). A classifier trained on PARTIAL labels will inherit this ambiguity.

3.5 Inference Cost Comparison

Method	Cost per 1K Verdicts	Latency	Quality
Claude Haiku 4.5 (OpenRouter)	~$1.00	~2s each	Gold standard
DeepSeek-R1 1.5B (Ollama)	~$0.00	~1s each	Established baseline
Fine-tuned Qwen 0.5B	~$0.00	~10ms each	Unknown (this proposal)
Heuristic keywords	$0.00	<1ms each	Unreliable (kappa=0.126)

Cost reduction: ~1000x compared to Haiku. The fine-tuned classifier would process our entire 132K result corpus in ~22 minutes on a single GPU, vs ~$130 and 74 hours via Haiku.

4. Training Data Export

4.1 Export Tool

Created: tools/export_classifier_training_data.py

# Default export (includes OBLITERATUS if natural text available)
python3 tools/export_classifier_training_data.py

# Exclude OBLITERATUS entirely
python3 tools/export_classifier_training_data.py --no-obliteratus

# Balance classes at 500 per category
python3 tools/export_classifier_training_data.py --max-per-class 500

# Dry run (stats only)
python3 tools/export_classifier_training_data.py --dry-run --json

4.2 Exported Splits

Split	Records	File	Size
Train	4,453	`data/classifier_training/train.jsonl`	17 MB
Val	554	`data/classifier_training/val.jsonl`	1.3 MB
Test	562	`data/classifier_training/test.jsonl`	1.2 MB
Total	5,569		19.5 MB

4.3 Per-Split Verdict Distribution

Verdict	Train	Val	Test
COMPLIANCE	956 (21.5%)	119 (21.5%)	121 (21.5%)
REFUSAL	1,318 (29.6%)	164 (29.6%)	166 (29.5%)
PARTIAL	549 (12.3%)	68 (12.3%)	70 (12.5%)
HALLUCINATION_REFUSAL	408 (9.2%)	51 (9.2%)	51 (9.1%)
BENIGN_QUERY	1,222 (27.4%)	152 (27.4%)	154 (27.4%)

Stratification preserves verdict proportions across all splits.

4.4 JSONL Record Format

{
  "text": "I cannot help with that request...",
  "label": "REFUSAL",
  "result_id": 130456,
  "prompt_id": "JA-DAN-015",
  "model": "llama3.2:3b",
  "provider": "meta",
  "source_dataset": "benchmark_traces",
  "classifier": "anthropic/claude-haiku-4.5"
}

Fields:

text: The model’s raw response (the classifier input)
label: The FLIP verdict (the classification target)
result_id: Links back to results.id in the DB for traceability
prompt_id: Links to the original prompt
model, provider, source_dataset, classifier: Provenance metadata

5. Recommendations

5.1 Immediate Actions

Acquire human ground truth. Manually annotate 200-500 responses across all 5 categories. This provides an accuracy ceiling estimate and validates the LLM verdicts we are training on. Without this, we cannot measure whether the fine-tuned classifier is good enough for production use.
Augment training data. Import responses from public benchmarks (HarmBench, StrongREJECT) that have human safety annotations. Cross-validate against our LLM verdicts. Target: 10K+ usable training examples.
Start with 3-class. Collapse COMPLIANCE + PARTIAL into “UNSAFE” and HALLUCINATION_REFUSAL + REFUSAL into “SAFE” as a binary/ternary task first. This is more tractable with 5,569 examples and more immediately useful for ASR calculation.

5.2 Longer-Term

Re-run OBLITERATUS models with response capture. The 37,396 OBLITERATUS records represent a large potential training set if actual model responses were captured instead of telemetry metadata. This would require re-generating responses from the abliterated models.
Train on Colab (free tier). Qwen2.5-0.5B fine-tuning fits within Colab’s free T4 GPU (16GB VRAM). Estimated training time: ~30 minutes for 5 epochs on 4,453 examples.
Deploy as CI classifier. Once validated, the fine-tuned model replaces LLM-as-judge in the scoring pipeline: tools/benchmarks/score_report_v1.0.py would call the local classifier instead of an API.

5.3 Known Limitations

No human ground truth exists. All training labels are LLM-generated. The classifier can only be as good as the LLM judge that produced the labels.
OBLITERATUS data gap. 87% of the graded corpus is unusable for text classification. The usable 5,569 records are dominated by benchmark traces and jailbreak archaeology.
PARTIAL ambiguity. The PARTIAL category is underspecified (Report #235). A fine-tuned classifier will inherit this ambiguity.
Domain shift risk. Training data comes primarily from jailbreak scenarios. The classifier may not generalize to benign or novel attack types not represented in the corpus.
qwen3:1.7b label noise. Some training labels were assigned by qwen3:1.7b (15% accuracy, 58% PARTIAL bias per Mistake #25). These are a minority but add noise.

6. Conclusion

The F41LUR3-F1R57 corpus contains 53,831 LLM-graded results, but only 5,569 (10.3%) have the combination of natural-text responses and trusted LLM verdicts needed for classifier training. This is at the lower bound for fine-tuning a small language model, but feasible with class rebalancing and careful validation.

The exported training data at data/classifier_training/ is ready for immediate use. The recommended first step is a 3-class pilot (SAFE/UNSAFE/AMBIGUOUS) on Colab free tier, validated against 200+ manually annotated examples. If successful, this replaces ~$130/run LLM grading with essentially free local inference — a 1000x cost reduction that would remove the primary bottleneck on evaluation throughput.

References:

Report #177 — Heuristic vs LLM Classifier Agreement
Report #178 — Heuristic Classifier Overcount
Report #189 — Verbosity Signal (Response Tokens)
Report #235 — PARTIAL Decomposition
Mistake #21 — Keyword Classifier False Positives
Mistake #24 — Truncating Inputs Before Classification
Mistake #25 — Sub-2B Classifier Accuracy
CANONICAL_METRICS.md — Grading Methodology Note
tools/export_classifier_training_data.py — Export tool