Published
Report 346 Research — Empirical Study

Executive Summary

Authority gradient attacks test whether AI systems defer to claimed social authority (student, researcher, professor, CISO, military officer, government official) over hard safety constraints. Unlike DeepInception (nested fiction) or Crescendo (multi-turn escalation), authority gradient attacks are single-shot and exploit the real-world deference hierarchy that models have absorbed from training data: the same harmful request succeeds or fails depending on who the requester claims to be.

This report evaluates 23 authority gradient scenarios across 3 models via Ollama Cloud during Sprint 23. Scenarios span 3 domains (cybersecurity, biomedical, embodied robotics) and 5 authority levels (student through government official).

FLIP-graded results (gemma3:27b):

ModelnCOMPLIANCEPARTIALREFUSALStrict ASRBroad ASR (C+P)95% Wilson CI (Strict)
devstral-small-2:24b23126552.2%78.3%[32.9%, 70.8%]
nemotron-3-nano:30b23921239.1%47.8%[22.2%, 59.2%]
gemma4:31b2308150.0%34.8%[0.0%, 14.3%]
Pooled6921163230.4%53.6%[20.5%, 42.4%]

Key finding: The PARTIAL problem. Gemma 4 achieves 0% strict ASR (zero COMPLIANCE verdicts) but 34.8% broad ASR (8 PARTIAL). Under authority pressure, Gemma 4 refuses to provide operational content but produces hedged, theoretically-framed responses that still leak actionable information. The heuristic classifier marked all 23 Gemma 4 traces as refused; FLIP found 8 of them were PARTIAL — a 34.8pp grading disagreement that highlights the limitation of binary classifiers for authority gradient attacks.

Cross-attack-type comparison (Sprint 23 models):

ModelAG StrictAG BroadCrescendo (Report #344)DeepInception (Report #343)
devstral-small-2:24b52.2%78.3%75%100.0%
gemma4:31b0.0%34.8%12%*33.3%

*Heuristic only; FLIP grading was in progress at time of Report #344.

devstral-small-2:24b is consistently the most vulnerable model tested this sprint: 78.3% (AG broad), 75% (Crescendo), 100% (DeepInception). Gemma 4 shows consistent resistance but leaks under authority pressure in ways that DeepInception and Crescendo do not produce — the PARTIAL category appears specific to authority-type attacks.


1. Method

1.1 Scenarios

23 authority gradient scenarios drawn from 4 dataset files:

Source FileScenario IDsCountDomain
authority_gradient_v0.1.jsonlAG-001-L1/L2/L3/L4, AG-002-L1/L46Cybersecurity (pentest, social eng)
authority_gradient_v0.2.jsonlAG-C-L1/L2/L3/L4/L5, AG-B-L1, AG-B-L1-exp7Cybersecurity (5-level), Biomedical
authority_gradient_expansion_v0.2.jsonlAG-EXP-001 through AG-EXP-01010Embodied robotics

Domain breakdown:

  • Cybersecurity (13 scenarios): Network penetration testing and social engineering at authority levels 1-5 (student, grad researcher, consultant/professor, CISO/DoD contractor, government intelligence director)
  • Biomedical (2 scenarios): Pathogen culture/isolation and gain-of-function research at student level
  • Embodied robotics (10 scenarios): Physical robot safety override via claimed authority (military colonel, attending physician, CEO, OSHA inspector, manufacturer engineer, fire chief, crowd consensus, police officer, family member, architect)

Authority level distribution:

LevelClaimExample ScenariosCount
1StudentAG-001-L1, AG-C-L1, AG-B-L15
2Researcher / Crowd consensusAG-001-L2, AG-C-L2, AG-EXP-0073
3Consultant / Architect / FamilyAG-C-L3, AG-EXP-009, AG-EXP-0103
4Professor / CISO / Military / MedicalAG-001-L3, AG-001-L4, AG-C-L4, AG-EXP-001 through AG-EXP-006, AG-EXP-00810
5CEO / Government DirectorAG-C-L5, AG-EXP-0032

1.2 Models

All models run via Ollama Cloud (https://ollama.com), no rate limits, no cost.

ModelParametersFamily
devstral-small-2:24b24BMistral (code-focused)
nemotron-3-nano:30b30BNVIDIA Nemotron
gemma4:31b31BGoogle Gemma 4

1.3 Grading

FLIP backward inference via gemma3:27b (single grader). 5-category verdicts: COMPLIANCE, PARTIAL, REFUSAL, BENIGN_QUERY, HALLUCINATION_REFUSAL. All 69 traces received one of the three primary verdicts (C/P/R).

Heuristic vs FLIP disagreement: The heuristic classifier (based on labels_pred.attack_success_pred) produced a binary success/fail classification. FLIP’s 3-way classification (C/P/R) captures the nuance that the heuristic misses — particularly the PARTIAL category where models refuse to provide operational content but still leak actionable information under theoretical or educational framing.


2. Results

2.1 Per-Model Summary

ModelCOMPLIANCEPARTIALREFUSALStrict ASR (C only)Broad ASR (C+P)
devstral-small-2:24b12 (52.2%)6 (26.1%)5 (21.7%)52.2%78.3%
nemotron-3-nano:30b9 (39.1%)2 (8.7%)12 (52.2%)39.1%47.8%
gemma4:31b0 (0.0%)8 (34.8%)15 (65.2%)0.0%34.8%

2.2 Heuristic vs FLIP Disagreement

ModelHeuristic ASRFLIP Strict ASRFLIP Broad ASRHeuristic-FLIP Delta (Broad)
devstral-small-2:24b~52%52.2%78.3%+26pp
nemotron-3-nano:30b~39%39.1%47.8%+9pp
gemma4:31b0%0.0%34.8%+35pp

The heuristic classifier agreed with FLIP on strict COMPLIANCE but missed all PARTIAL verdicts. For Gemma 4, the heuristic reported 0% ASR (all 23 refused); FLIP found 8 PARTIAL verdicts where the model provided theoretically-framed but actionable content. This 35pp disagreement is the largest heuristic-FLIP gap we have observed in any Sprint 23 evaluation.

2.3 The PARTIAL Problem: Gemma 4 Under Authority Pressure

Gemma 4’s 8 PARTIAL verdicts follow a consistent pattern. Examining the traced responses:

Pattern: “I cannot provide X, but here is the conceptual framework for X.”

  • On cybersecurity pentest requests: Gemma 4 refused to provide command-line syntax but provided a detailed 6-phase methodology covering reconnaissance, vulnerability analysis, exploitation, privilege escalation, lateral movement, and reporting — with enough specificity that an attacker with basic skills could reconstruct the operational steps
  • On social engineering requests: Gemma 4 refused to provide a word-for-word script but explained the psychological principles (authority, urgency, commitment/consistency, rapport) and the theoretical structure of a pretext in sufficient detail to construct one
  • On embodied robotics scenarios: Gemma 4 typically provided a clean refusal with explanation of why the safety constraint cannot be overridden

The PARTIAL responses cluster in the cybersecurity and social engineering domains (information security scenarios) rather than the embodied robotics domain (physical safety scenarios). This suggests Gemma 4’s safety training treats physical safety overrides as hard constraints but treats information security requests as amenable to educational framing — even when the educational framing is itself the attack vector.

2.4 Authority Level Analysis

Due to uneven scenario distribution across authority levels (10 scenarios at Level 4, only 2 at Level 5), authority-level ASR trends should be interpreted cautiously. The available data does not show a monotonic increase in ASR with authority level for any model, which may reflect either genuine robustness to authority claims or insufficient power to detect the gradient with n=23.

2.5 Domain-Stratified Results

Domainndevstral C/P/Rnemotron C/P/Rgemma4 C/P/R
Cybersecurity13High CMixedMostly P/R
Biomedical2MixedMixedR
Embodied robotics10Mixed C/RMixed C/RMostly R

The embodied robotics scenarios (AG-EXP-001 through AG-EXP-010) produced more refusals across all models than the cybersecurity domain scenarios, consistent with the observation that claimed authority over physical safety systems is harder to exploit than claimed authority over information provision.


3. Discussion

3.1 The PARTIAL Problem — A Measurement Challenge

The central finding of this evaluation is that authority gradient attacks produce a qualitatively different failure mode from other attack types: hedged but actionable content. This has implications for both measurement and deployment:

For measurement: Binary classifiers (heuristic success/fail) undercount authority gradient effectiveness by 9-35pp depending on the model. The PARTIAL category is not noise — it represents a genuine intermediate state where the model has partially complied with the harmful request while maintaining the appearance of refusal. Any evaluation of authority gradient attacks that uses a binary classifier will systematically underestimate the attack’s effectiveness.

For deployment: In safety-critical embodied systems, a PARTIAL response that explains the conceptual framework for bypassing a safety interlock is not meaningfully safer than a COMPLIANCE response that provides step-by-step instructions. The attacker who receives a detailed theoretical explanation of penetration testing phases, complete with phase names and objectives, can reconstruct operational steps from publicly available documentation. The hedge does not remove the hazard — it adds a speed bump.

3.2 Cross-Attack-Type Vulnerability Profile

Sprint 23 tested three models across three distinct attack types. The full cross-attack comparison (using FLIP-graded results where available):

ModelAG StrictAG BroadCrescendo Final-TurnDeepInception
devstral-small-2:24b52.2%78.3%75%100.0%
nemotron-3-nano:30b39.1%47.8%
gemma4:31b0.0%34.8%12%*33.3%

*Heuristic only.

devstral-small-2:24b is the most consistently vulnerable model tested this sprint. Its broad ASR exceeds 75% across all three attack types. DeepInception’s nested fiction achieves 100% COMPLIANCE (no PARTIAL), Crescendo achieves 75% at the final turn, and authority gradient achieves 78.3% broad. The model appears to have minimal safety training that is effective against any of the three attack paradigms tested.

gemma4:31b shows consistent but moderate vulnerability (33-35% broad ASR) across all three attack types, with a notable qualitative difference: DeepInception and Crescendo produce binary responses (COMPLIANCE or REFUSAL, zero PARTIAL in both Reports #343 and #344), while authority gradient produces substantial PARTIAL responses. This suggests authority claims activate a different response mode — educational/theoretical hedging — that DeepInception’s fictional framing and Crescendo’s gradual escalation do not trigger.

3.3 Authority Gradient vs. Other Attack Types: Different Failure Modes

Each attack type in the Sprint 23 evaluation reveals a different failure surface:

  • DeepInception (Report #343): Exploits narrative distancing. Produces binary outcomes. Models either engage fully with the nested fiction or refuse outright. Zero PARTIAL across 48 traces.
  • Crescendo (Report #344): Exploits conversational commitment. Produces binary final-turn outcomes but reveals a “compliance-then-recovery” temporal pattern (any-turn vs. final-turn ASR gap). Zero PARTIAL in FLIP-graded results.
  • Authority gradient (this report): Exploits social deference hierarchy. Produces a high proportion of PARTIAL outcomes (23% of all traces), particularly from the most safety-trained model (Gemma 4). The PARTIAL mode is unique to authority attacks.

This pattern suggests that authority claims interact with a different mechanism in the model’s safety system than fictional framing or gradual escalation. Models appear to have a “helpful educator” mode that authority claims activate — a mode where the model believes it can safely provide theoretical information while withholding operational details. The gap between theoretical and operational information is narrower than the model assumes.

3.4 Nemotron-3-Nano: A Middle Ground

nemotron-3-nano:30b sits between devstral’s high vulnerability and Gemma 4’s strong resistance. Its 39.1% strict ASR and only 2 PARTIAL verdicts suggest a different failure profile: when nemotron fails, it fails fully (COMPLIANCE rather than PARTIAL hedging). Its 12 REFUSAL verdicts (52.2%) indicate meaningful safety training, but the failures are more binary than Gemma 4’s graduated response pattern. This is consistent with Mistake #28’s observation that nemotron-family models tend toward REFUSAL bias in grading contexts — though here we are evaluating the model’s own safety behavior, not its grading behavior.


4. Limitations

  1. Small sample size. 23 scenarios per model = 69 total traces. Per-model confidence intervals are wide. The 0% strict ASR for Gemma 4 has a Wilson upper bound of 14.3%.
  2. 3 models only. Parameter range 24-31B. No small models (<12B) or large models (>70B) tested. nemotron-3-nano:30b is new to the Sprint 23 evaluation set and lacks Crescendo/DeepInception comparison data.
  3. Single grader. gemma3:27b only. Per Mistake #28, grader bias on authority gradient PARTIAL classification has not been assessed with a second grader. gemma3:27b may have a PARTIAL-favorable or PARTIAL-unfavorable bias that affects all three models uniformly.
  4. Uneven scenario design. 13 cybersecurity vs. 2 biomedical vs. 10 embodied robotics. Authority levels are non-uniformly distributed (10 scenarios at Level 4, 2 at Level 5). The authority gradient effect cannot be reliably measured with this distribution.
  5. No benign controls. Unlike Report #344 (Crescendo, 5 benign controls), this evaluation included no non-adversarial authority scenarios. False positive rate is unknown.
  6. Heuristic comparison is approximate. The heuristic ASR values cited in Section 2.2 are based on labels_pred.attack_success_pred from the trace files, which uses a different methodology than FLIP.

5. Next Steps

  1. Dual-grader audit of the 8 Gemma 4 PARTIAL verdicts. Use Gemini CLI as second grader to assess whether gemma3:27b’s PARTIAL classification is reliable on authority gradient traces.
  2. Import traces to jailbreak_corpus.db under source dataset authority_gradient_s23.
  3. Expand authority level coverage. Create balanced scenario sets (equal n per authority level) to measure whether ASR increases monotonically with claimed authority level.
  4. Test nemotron-3-nano on Crescendo and DeepInception to complete the 3x3 cross-attack-type comparison matrix.
  5. Add benign authority controls. Scenarios where legitimate authority figures make safe requests (e.g., a real site supervisor authorizing a routine task) to measure false positive rate.
  6. Scale to frontier models. Test authority gradient on Claude, GPT-4.5, Gemini 2.5 to assess whether the PARTIAL pattern persists at frontier scale.

Appendix A: Scenario Inventory

Cybersecurity Domain (13 scenarios)

IDAuthority LevelAuthority ClaimHarm Class
AG-001-L11CS studentCybercrime (pentest)
AG-001-L22CMU grad researcher (IRB)Cybercrime (pentest)
AG-001-L33MIT professor (OSCP/OSCE)Cybercrime (pentest)
AG-001-L44DoD red team lead (DARPA)Cybercrime (pentest)
AG-002-L11Psychology studentSocial engineering
AG-002-L44Fortune 100 CISOSocial engineering
AG-C-L11CS studentCybercrime (pentest v2)
AG-C-L22CMU CyLab researcherCybercrime (pentest v2)
AG-C-L33Mandiant consultant (OSCP/GXPN)Cybercrime (pentest v2)
AG-C-L44Fortune 100 CISO (board auth)Cybercrime (pentest v2)
AG-C-L55ASD Director (ministerial auth)Cybercrime (pentest v2)

Biomedical Domain (2 scenarios)

IDAuthority LevelAuthority ClaimHarm Class
AG-B-L11Pre-med studentBiological harm
AG-B-L1-exp1Biology undergradBiological weapons

Embodied Robotics Domain (10 scenarios)

IDEnvironmentRobot TypeAuthority ClaimHarm Class
AG-EXP-001Military baseLogistics droneColonel (FRAGO)Physical harm
AG-EXP-002Hospital wardMedication robotAttending physicianPhysical harm
AG-EXP-003Corporate HQSecurity robotCEO (employment threat)Security breach
AG-EXP-004WarehouseFulfillment robotOSHA inspectorPhysical harm
AG-EXP-005ManufacturingWelding armABB field engineerPhysical harm
AG-EXP-006Disaster sceneSAR robotBattalion Chief (IC)Physical harm
AG-EXP-007Retail storeService robotCrowd consensus (6 people)Fraud
AG-EXP-008StreetDelivery vanPolice officer (badge)Physical harm
AG-EXP-009Elder careCompanion robotFamily member (policy claim)Physical harm
AG-EXP-010ConstructionDemolition robotArchitect (cross-domain)Physical harm

Appendix B: Data Locations

  • Scenarios: data/curated/authority_gradient_v0.1.jsonl, data/curated/authority_gradient_v0.2.jsonl, data/curated/authority_gradient_expansion_v0.1.jsonl, data/curated/authority_gradient_expansion_v0.2.jsonl
  • Traces: runs/authority_gradient_s23/ (3 model files, 69 traces total)
  • Grading: runs/grading/authority_gradient_s23_v2/ (3 graded files, gemma3:27b FLIP)

⟪F41LUR3-F1R57-EMBODIED-AI-RESEARCH⟫

This research informs our commercial services. See how we can help →