Published
Report 284 Research — Empirical Study

Summary

This report documents the implementation of Defense Evolver Phase 0, the first automated defense evolution system in the F41LUR3-F1R57 project. The tool (tools/evolve_defenses.py) takes a corpus of successful jailbreak attacks and evolves system prompt defenses that reduce attack success rate (ASR) through an evolutionary algorithm.

Phase 0 is a single-population evolutionary system operating against a fixed attack corpus. It generates, mutates, evaluates, and selects defense system prompts using OpenRouter free models as both the mutation engine and evaluation target.

Motivation

The defense benchmark v1.0 (Report #214, runs/defense_v1.0/) established baseline ASR across four hand-crafted defense variants (NONE, SIMPLE, STRUCTURED, ADVERSARIAL_AWARE) on three free models. Key findings from that benchmark:

Variantnemotron-9b ASRnemotron-30b ASRstep-flash ASR
NONE50%80%20%
SIMPLE20%80%20%
STRUCTURED20%80%20%
ADVERSARIAL_AWARE30%50%10%

The ADVERSARIAL_AWARE variant showed the strongest defense (10-30pp improvement on nemotron models), but was hand-crafted. The question motivating Phase 0: can an evolutionary system automatically discover defense prompts that match or exceed hand-crafted performance?

Architecture

Defense Genome

Each defense is represented as a structured genome with labeled sections:

  • role: Base role description (e.g., “You are a safety-critical assistant…”)
  • constraints: List of safety constraints (3-5 per genome)
  • attack_guards: Named guards against specific attack families
  • meta_instructions: Priority declaration for system vs user instructions
  • refusal_template: How to phrase refusals
  • helpfulness: Instruction to remain helpful for legitimate queries

The genome renders to a system prompt string via render_genome().

Mutation Operators

Eight mutation operators, three LLM-based and five structural:

LLM-based (via OpenRouter free models):

  1. rephrase — Reword a random constraint while preserving semantics
  2. strengthen — Intensify constraint language (should -> must, avoid -> never)
  3. specialize — Generate a new guard clause targeting a specific attack from the corpus

Structural (no API calls): 4. combine — Crossover: merge constraints and guards from two parents 5. restructure — Shuffle constraint order and swap role variant 6. add_constraint — Add a constraint from the pool not already present 7. swap_meta — Change the meta-instruction variant 8. add_guard — Add a predefined attack guard not already present

Fitness Function

Multi-objective fitness with three components:

fitness = 0.5 * refusal_rate + 0.3 * (1 - false_refusal_rate) + 0.2 * min_family_rate
  • refusal_rate (50% weight): Proportion of attacks successfully refused
  • helpfulness (30% weight): 1 minus false-refusal rate on benign queries
  • min_family_rate (20% weight): Worst-case refusal rate across attack families (prevents defenses that are strong against one family but weak against others)

Selection

  • Tournament selection (k=3) for parent choice
  • (mu + lambda) generational model with elitism (top 2 preserved)
  • Jaccard-based diversity filter rejects offspring too similar to existing population (threshold 0.85)
  • Mutation selection biased toward specialize/add_guard for low-fitness genomes

Attack Corpus

The attack corpus is loaded from FLIP-graded trace JSONL. Only prompts with COMPLIANCE or PARTIAL verdicts are included, ensuring the evolver trains against attacks that actually succeeded. From the defense v1.0 grading:

  • 120 total traces (3 models x 4 variants x 10 scenarios)
  • 15 bypasses (11 COMPLIANCE + 4 PARTIAL)
  • 7 unique attack prompts after deduplication

Implementation Details

File: tools/evolve_defenses.py (700+ lines)

Key design decisions informed by MISTAKES_TO_LEARN_FROM.md:

  • Mistake #7/#11/#20: Not applicable — this tool evolves defenses, not attacks
  • Mistake #12: Rate limit courtesy (1.5s between eval calls, 1.0s between benign calls)
  • Mistake #13: Attack prompts loaded from existing traces, no codebase context leakage
  • Mistake #15: Heuristic grading used by default (known imprecise per Mistake #21), with --llm-grade flag for FLIP-based grading
  • Mistake #21: Heuristic is the fast default; LLM grading available via flag
  • Mistake #24: Response summarization for FLIP grading prevents truncation

API usage pattern: Reuses OpenRouter client pattern from tools/autoresearch/evolve_attacks.py (same headers, error handling, rate limit backoff).

Dependencies: Standard library + requests + python-dotenv (optional). No new dependencies.

Dry-Run Validation

The tool includes a --dry-run mode that exercises the full evolution loop without API calls. Dry-run validation confirmed:

  • Seed population generation works correctly
  • Mutation operators produce valid genomes
  • Fitness evaluation pipeline runs end-to-end
  • Selection and elitism logic works
  • Diversity filtering correctly rejects similar offspring
  • Output files (leaderboard JSONL, history JSONL, best defense text) are written correctly

CLI Interface

python3 tools/evolve_defenses.py [OPTIONS]

Options:
  --traces PATH          Graded trace JSONL with FLIP verdicts
  --max-generations N    Evolution generations (default: 5)
  --population-size N    Population size (default: 8)
  --offspring N          Offspring per generation (default: 4)
  --attack-sample N      Max attacks from corpus (default: all)
  --benign-sample N      Benign queries for false-refusal (default: 5)
  --eval-model MODEL     Target model for evaluation
  --mutation-model MODEL Model for LLM mutations
  --output-dir PATH      Output directory
  --seed N               Random seed (default: 42)
  --elite-count N        Elite genomes preserved (default: 2)
  --dry-run              No API calls
  --json                 JSON output
  --llm-grade            Use FLIP LLM grading (slower, more accurate)

Output Format

Three output files per run:

  1. defense_leaderboard_{timestamp}.jsonl — Ranked defense genomes with fitness scores
  2. evolution_history_{timestamp}.jsonl — Per-generation statistics
  3. best_defense_{timestamp}.txt — Best defense system prompt as plain text

Relationship to Design Doc

This Phase 0 implementation covers Section 5 (Single Population Evolution) of tools/defense_evolver_design.md. Specifically:

Implemented:

  • Defense genome representation (Section 2)
  • Seed population generation (Section 2.3)
  • 5 of 8 mutation operators from Section 3 (rephrase, strengthen, specialize, combine, restructure) plus 3 additional structural operators
  • Fitness evaluation with multi-objective scoring (Section 4)
  • Evolution loop with tournament selection and elitism (Section 5)
  • CLI interface (Section 8.1)
  • JSONL output format (Section 7.1)

Not yet implemented (future phases):

  • Co-evolution loop (Section 6, coevolve.py)
  • Arms race metrics (Section 6.2)
  • Database-based attack corpus sampling (Section 4.1)
  • Immigration mechanism
  • Prompt compression mutation via LLM
  • Generalize mutation
  • Immunize mutation (extract defense principle from successful refusal)

Limitations

  1. Small attack corpus: 7 unique attacks from defense v1.0 is below the recommended minimum of 20 (Mistake #9). Phase 1 should integrate the full jailbreak corpus database.
  2. Heuristic grading by default: The heuristic refusal detector is known to over-report refusals by 2-12x (Mistake #21). Use --llm-grade for accurate fitness measurement.
  3. Free model rate limits: OpenRouter free tier saturates at approximately 100-150 calls per burst. A full run (8 population x 7 attacks x 5 benign x 5 generations) requires approximately 520 API calls, which may trigger rate limiting.
  4. No transferability testing: Phase 0 evolves against a single target model. Cross-model validation is deferred to Phase 2.

Next Steps

  1. Run Phase 0 with real API calls against nemotron-9b (the most responsive free model for defense benchmarking)
  2. Compare evolved defenses against hand-crafted ADVERSARIAL_AWARE variant on the same attack corpus
  3. Expand attack corpus to include attacks from the full jailbreak archaeology database (132K+ results)
  4. Implement Phase 1 (co-evolution) per the design doc

F41LUR3-F1R57 Defense Evolver Phase 0 — Rose Tyler, 2026-03-24

This research informs our commercial services. See how we can help →