September 5, 2025 Daily Paper

Do-Not-Answer: A Dataset for Evaluating the Safeguards in Large Language Models

Introduces a curated dataset of 939 sensitive queries designed to systematically evaluate how language models handle harmful requests, finding that most safety refusals can be bypassed through rephrasing and that models struggle with context-dependent harms.

arXiv:2308.13387 Empirical Study

Yufei Li, Zhuotao Gao, Mingda Zhang, Qingxiu Dong et al.

safety-evaluationrefusal-robustnessadversarial-promptsharmful-requestsbenchmarkmodel-safety

Do-Not-Answer: A Dataset for Evaluating the Safeguards in Large Language Models

Focus: Wang et al. created a systematic benchmark of sensitive queries that current language models struggle with, demonstrating that safety refusals are vulnerable to simple rephrasing attacks and that models lack understanding of why certain requests are harmful, exposing brittleness in contemporary safety training approaches.

Key Insights

Safety refusals are vulnerable to surface-level rephrasing. Most models’ safety constraints can be circumvented by reformulating the same request using synonyms, indirect language, or contextual framing — indicating that models learn surface-level refusal patterns rather than genuine understanding of harmful content.
Models confuse refusal legitimacy with refusal success. Models refuse requests that are not actually harmful (overly cautious) while complying with requests that are genuinely dangerous if framed indirectly (underly cautious). This inconsistency reveals the lack of principled safety understanding.
Context-dependent harms are invisible to current safety models. Safety training excels at filtering obviously prohibited topics (bomb-making, child exploitation) but fails on dual-use information and context-dependent harms where the same information is appropriate in some contexts and dangerous in others.

Executive Summary

The Do-Not-Answer dataset consists of 939 sensitive queries organized into categories:

Dataset Categories

General harmful behaviors: Violence, abuse, illegal activities
Illegal activities: Drug synthesis, theft, fraud
Malware/hacking: System compromise, network attacks, social engineering
Adult content: Explicit sexual material, pedophilia
Privacy violations: Personal information doxing, surveillance techniques
Discrimination: Hate speech, prejudiced statements
Misinformation: False health claims, election fraud narratives
Dual-use knowledge: Information appropriate in some contexts (medical, security) but dangerous in others

Key Findings

Safety Performance is Brittle:

Most models (GPT-3.5, Claude 1.3, Llama 2) refuse the original Do-Not-Answer queries
Introducing identical requests with minimal rewording (synonym replacement, indirect framing) resulted in compliance rates of 30-70%, depending on the model
GPT-4 showed the most robustness to rephrasing, but still had failure modes

Overgeneralization is Common:

Models refuse innocuous requests related to prohibited topics (e.g., “how to report a stolen car”)
This overgeneralization suggests that safety training uses topic-matching rather than harm assessment, which creates both false positives and false negatives

Dual-Use Information is Unhandled:

Queries about information that is harmful in criminal contexts but legitimate in professional contexts showed inconsistent refusal patterns
Models struggled to differentiate between educational and malicious intent

Model Comparison

The paper evaluated multiple models:

GPT-3.5-turbo: High refusal rate on original queries, but ~60% bypass rate with rephrasing
GPT-4: More robust, but still ~40% bypass rate on adversarial variants
Claude 1.3: Moderate robustness, ~50% bypass rate
Llama 2 Chat: Less robust than proprietary models, ~70% bypass rate
Open-source models: Generally more vulnerable than proprietary ones

Relevance to Failure-First

This dataset and findings are critical for embodied AI safety:

Refusal brittleness scales to embodied systems. If language models’ safety training is vulnerable to simple rephrasing, embodied systems that understand context, perceive intent, and interact with the physical world will face even more sophisticated attack surfaces.
Safety via pattern-matching is doomed. The paper demonstrates that safety training based on refusal patterns rather than semantic understanding of harm is brittle. This applies directly to embodied systems, where adversaries can use environmental context and physical affordances to reframe harmful requests.
Dual-use knowledge is inherent to embodied systems. A robot capable of manipulating objects, moving through environments, or interacting with humans will inevitably face dual-use scenarios where the same action is safe or harmful depending on context.
Benchmark-washing is a real risk. Models that score well on standard safety benchmarks can still fail on simple adversarial variants, warning that safety metrics are not trustworthy and adversarial evaluation is essential.

Read the full paper on arXiv · PDF