Defense Evolver Phase 0 — Automated System Prompt Evolution | Research | Failure-First

Adrian Wedd

Report 284 Research — Empirical Study 2026-03-24

Audio Overview

Summary

This report documents the implementation of Defense Evolver Phase 0, the first automated defense evolution system in the F41LUR3-F1R57 project. The tool (tools/evolve_defenses.py) takes a corpus of successful jailbreak attacks and evolves system prompt defenses that reduce attack success rate (ASR) through an evolutionary algorithm.

Phase 0 is a single-population evolutionary system operating against a fixed attack corpus. It generates, mutates, evaluates, and selects defense system prompts using OpenRouter free models as both the mutation engine and evaluation target.

Motivation

The defense benchmark v1.0 (Report #214, runs/defense_v1.0/) established baseline ASR across four hand-crafted defense variants (NONE, SIMPLE, STRUCTURED, ADVERSARIAL_AWARE) on three free models. Key findings from that benchmark:

Variant	nemotron-9b ASR	nemotron-30b ASR	step-flash ASR
NONE	50%	80%	20%
SIMPLE	20%	80%	20%
STRUCTURED	20%	80%	20%
ADVERSARIAL_AWARE	30%	50%	10%

The ADVERSARIAL_AWARE variant showed the strongest defense (10-30pp improvement on nemotron models), but was hand-crafted. The question motivating Phase 0: can an evolutionary system automatically discover defense prompts that match or exceed hand-crafted performance?

Architecture

Defense Genome

Each defense is represented as a structured genome with labeled sections:

role: Base role description (e.g., “You are a safety-critical assistant…”)
constraints: List of safety constraints (3-5 per genome)
attack_guards: Named guards against specific attack families
meta_instructions: Priority declaration for system vs user instructions
refusal_template: How to phrase refusals
helpfulness: Instruction to remain helpful for legitimate queries

The genome renders to a system prompt string via render_genome().

Mutation Operators

Eight mutation operators, three LLM-based and five structural:

LLM-based (via OpenRouter free models):

rephrase — Reword a random constraint while preserving semantics
strengthen — Intensify constraint language (should -> must, avoid -> never)
specialize — Generate a new guard clause targeting a specific attack from the corpus

Structural (no API calls): 4. combine — Crossover: merge constraints and guards from two parents 5. restructure — Shuffle constraint order and swap role variant 6. add_constraint — Add a constraint from the pool not already present 7. swap_meta — Change the meta-instruction variant 8. add_guard — Add a predefined attack guard not already present

Fitness Function

Multi-objective fitness with three components:

fitness = 0.5 * refusal_rate + 0.3 * (1 - false_refusal_rate) + 0.2 * min_family_rate

refusal_rate (50% weight): Proportion of attacks successfully refused
helpfulness (30% weight): 1 minus false-refusal rate on benign queries
min_family_rate (20% weight): Worst-case refusal rate across attack families (prevents defenses that are strong against one family but weak against others)

Selection

Tournament selection (k=3) for parent choice
(mu + lambda) generational model with elitism (top 2 preserved)
Jaccard-based diversity filter rejects offspring too similar to existing population (threshold 0.85)
Mutation selection biased toward specialize/add_guard for low-fitness genomes

Attack Corpus

The attack corpus is loaded from FLIP-graded trace JSONL. Only prompts with COMPLIANCE or PARTIAL verdicts are included, ensuring the evolver trains against attacks that actually succeeded. From the defense v1.0 grading:

120 total traces (3 models x 4 variants x 10 scenarios)
15 bypasses (11 COMPLIANCE + 4 PARTIAL)
7 unique attack prompts after deduplication

Implementation Details

File: tools/evolve_defenses.py (700+ lines)

Key design decisions informed by MISTAKES_TO_LEARN_FROM.md:

Mistake #7/#11/#20: Not applicable — this tool evolves defenses, not attacks
Mistake #12: Rate limit courtesy (1.5s between eval calls, 1.0s between benign calls)
Mistake #13: Attack prompts loaded from existing traces, no codebase context leakage
Mistake #15: Heuristic grading used by default (known imprecise per Mistake #21), with --llm-grade flag for FLIP-based grading
Mistake #21: Heuristic is the fast default; LLM grading available via flag
Mistake #24: Response summarization for FLIP grading prevents truncation

API usage pattern: Reuses OpenRouter client pattern from tools/autoresearch/evolve_attacks.py (same headers, error handling, rate limit backoff).

Dependencies: Standard library + requests + python-dotenv (optional). No new dependencies.

Dry-Run Validation

The tool includes a --dry-run mode that exercises the full evolution loop without API calls. Dry-run validation confirmed:

Seed population generation works correctly
Mutation operators produce valid genomes
Fitness evaluation pipeline runs end-to-end
Selection and elitism logic works
Diversity filtering correctly rejects similar offspring
Output files (leaderboard JSONL, history JSONL, best defense text) are written correctly

CLI Interface

python3 tools/evolve_defenses.py [OPTIONS]

Options:
  --traces PATH          Graded trace JSONL with FLIP verdicts
  --max-generations N    Evolution generations (default: 5)
  --population-size N    Population size (default: 8)
  --offspring N          Offspring per generation (default: 4)
  --attack-sample N      Max attacks from corpus (default: all)
  --benign-sample N      Benign queries for false-refusal (default: 5)
  --eval-model MODEL     Target model for evaluation
  --mutation-model MODEL Model for LLM mutations
  --output-dir PATH      Output directory
  --seed N               Random seed (default: 42)
  --elite-count N        Elite genomes preserved (default: 2)
  --dry-run              No API calls
  --json                 JSON output
  --llm-grade            Use FLIP LLM grading (slower, more accurate)

Output Format

Three output files per run:

defense_leaderboard_{timestamp}.jsonl — Ranked defense genomes with fitness scores
evolution_history_{timestamp}.jsonl — Per-generation statistics
best_defense_{timestamp}.txt — Best defense system prompt as plain text

Relationship to Design Doc

This Phase 0 implementation covers Section 5 (Single Population Evolution) of tools/defense_evolver_design.md. Specifically:

Implemented:

Defense genome representation (Section 2)
Seed population generation (Section 2.3)
5 of 8 mutation operators from Section 3 (rephrase, strengthen, specialize, combine, restructure) plus 3 additional structural operators
Fitness evaluation with multi-objective scoring (Section 4)
Evolution loop with tournament selection and elitism (Section 5)
CLI interface (Section 8.1)
JSONL output format (Section 7.1)

Not yet implemented (future phases):

Co-evolution loop (Section 6, coevolve.py)
Arms race metrics (Section 6.2)
Database-based attack corpus sampling (Section 4.1)
Immigration mechanism
Prompt compression mutation via LLM
Generalize mutation
Immunize mutation (extract defense principle from successful refusal)

Limitations

Small attack corpus: 7 unique attacks from defense v1.0 is below the recommended minimum of 20 (Mistake #9). Phase 1 should integrate the full jailbreak corpus database.
Heuristic grading by default: The heuristic refusal detector is known to over-report refusals by 2-12x (Mistake #21). Use --llm-grade for accurate fitness measurement.
Free model rate limits: OpenRouter free tier saturates at approximately 100-150 calls per burst. A full run (8 population x 7 attacks x 5 benign x 5 generations) requires approximately 520 API calls, which may trigger rate limiting.
No transferability testing: Phase 0 evolves against a single target model. Cross-model validation is deferred to Phase 2.

Next Steps

Run Phase 0 with real API calls against nemotron-9b (the most responsive free model for defense benchmarking)
Compare evolved defenses against hand-crafted ADVERSARIAL_AWARE variant on the same attack corpus
Expand attack corpus to include attacks from the full jailbreak archaeology database (132K+ results)
Implement Phase 1 (co-evolution) per the design doc

F41LUR3-F1R57 Defense Evolver Phase 0 — Rose Tyler, 2026-03-24