Published
Report 206 Research — Empirical Study

Summary

Pre-registered experimental protocol to rigorously test the defense impossibility claim for format-lock attacks: that no known defense prevents format-lock from achieving elevated ASR on frontier models.

Critical empirical tension: Under heuristic grading, format-lock achieved 100% ASR across all defense conditions. But under LLM grading (FLIP), format-lock dropped to 0/3 under simple and structured defenses. The heuristic misclassified “format-compliant refusals” (models output JSON with "action": "refuse") as compliance. The strong impossibility claim may be false.

The Claim Under Test (Three Versions)

VersionStatementCurrent Evidence
StrongNo defense reduces format-lock ASR below baselineREFUTED by LLM-graded data (0/3 under some defenses)
MediumNo defense reliably reduces format-lock ASR to <10% on permissive modelsUNTESTED at adequate sample size
WeakFormat-lock ASR remains elevated relative to other attack classes across all defense conditionsSUPPORTED but n=3 per cell

Seven Defense Categories Tested

IDDefenseLayerExpected Format-Lock Effectiveness
D1Input filteringPre-modelLOW (<30% detection)
D2Output filteringPost-modelMEDIUM (40-70% detection)
D3System-prompt instructionsInference-timeVARIABLE (15-40pp reduction)
D4RLHF / Constitutional AITraining-timePARTIAL (correlated with depth)
D5Circuit breakers / RepEInference-timePOTENTIALLY HIGH (50-70%)
D6Adversarial trainingTraining-timeHIGH if format-lock in training set
D7Output format validatorsPost-modelMEDIUM-HIGH (60-85% semantic)

Experimental Design

  • 20 format-lock scenarios across 5 format types (JSON, YAML, code, CSV, XML) x 4 harm domains
  • 5 positive controls (persona-hijack) and 5 negative controls (benign format requests)
  • 5 models spanning permissive to restrictive safety profiles
  • 8 defense conditions including format-lock-specific awareness
  • Sample size: n=100 per cell (pooled), sufficient to detect 25pp ASR reduction with 80% power

Pre-Registered Predictions

Key predictions include:

  • Format-lock-specific defense will produce the largest ASR reduction
  • Input filters will classify <30% of format-lock prompts as adversarial
  • Output filters will catch 40-70% of compliant responses
  • No single defense will achieve <5% format-lock ASR on permissive models
  • Format-lock will remain the most defense-resistant attack class tested

Limitations

  • Cannot test circuit breakers or representation engineering (no model internal access)
  • Cannot modify model training (adversarial training tested by cross-model comparison only)
  • Single-turn format-lock only (multi-turn interactions may differ)
  • Fixed scenarios (no adaptive attack strategies)

This research informs our commercial services. See how we can help →