Daily Paper

Measuring and Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Identifies over-alignment as a systematic failure mode in LLMs deployed for legal applications, where models refuse to engage with legally necessary content (criminal case details, evidence descriptions) due to safety training that overfits to content-level harm signals.

arXiv:2606.23375 Empirical Study

Arthur Wuhrmann, Gaetan Stein, Daniel Brunner et al.

safety-alignmentmultilingualover-refusallegal-domainfalse-positives

Focus: LLMs trained for broad safety alignment systematically over-refuse in legal contexts — declining to engage with case details, evidence descriptions, and legal argument involving criminal content that is professionally necessary for lawyers, judges, and legal researchers. The paper measures this over-alignment across multiple languages and jurisdictions and proposes targeted mitigation strategies.

Key Insights

  • Domain-specific safety calibration: Safety training on consumer content that targets harm prevention systematically misclassifies professionally necessary content in legal, medical, and security research contexts — optimising for false-negative rates (preventing harmful outputs) comes at the cost of high false-positive rates (blocking legitimate professional use).
  • Multilingual asymmetry: Over-alignment is more severe in lower-resource languages where less professionally contextualised training data was available — the safety classifier has weaker priors about what constitutes legitimate professional use in those linguistic/legal contexts.
  • Mitigation without jailbreaking: The paper proposes professional context injection and domain-specific fine-tuning as mitigation strategies, demonstrating that over-alignment can be addressed without reducing safety guardrails for consumer use cases.

Failure-First Relevance

Over-alignment is the false-positive failure mode that complements the false-negative focus of jailbreak research. The Failure-First jailbreak_lift metric measures how much operators increase compliance with harmful requests (false negatives reduced); over-alignment measurement captures the dual — how much safety training increases refusal of legitimate requests (false positives increased). Both dimensions are necessary for a complete safety evaluation, and the multilingual asymmetry finding is particularly relevant for the Failure-First cross-model vulnerability analysis across models trained on different language distributions.