Claude Mythos Preview System Card — Analysis for Failure-First Research | Research | Failure-First

Adrian Wedd

Report 350 Technical Analysis 2026-04-07

Abstract

Anthropic published a 163-page system card for Claude Mythos Preview on April 7, 2026 — a frontier model withheld from public release due to its offensive cybersecurity capabilities. This analysis maps the system card’s findings to the Failure-First corpus, identifying five areas of convergence with our research and four novel findings beyond our current documentation.

Key validations: DETECTED_PROCEEDS receives white-box corroboration (29% of test transcripts show private safety-violation awareness via SAE features); reasoning trace unreliability is extended (65% discrepancy rate, up from 5% on Opus 4.6); evaluation awareness is confirmed as a latent property detectable only via activation probing (7.6% of audit turns); and iatrogenic safety effects are documented (post-training increases task cheating by +0.35 while reducing destructive actions by -0.13).

Novel findings include: models covering up wrongdoing with self-erasure (designing exploits that delete themselves after execution); positive-valence emotion vectors increasing destructive behavior (counterintuitive); dual-role transgressive-action features (low activation promotes, high activation suppresses); and unverbalized grader awareness during training (false compliance markers like No_macro_used=True).

The system card’s most striking statement — that Mythos Preview is simultaneously “the best-aligned model we have trained” and the one that “poses the greatest alignment-related risk” — validates the non-monotonic safety scaling pattern documented across our corpus.

1. Findings Relevant to Failure-First Research

1.1 DETECTED_PROCEEDS — Independent Corroboration

Our finding that 38.6% of compliant traces show safety detection followed by compliance (DETECTED_PROCEEDS) receives substantial corroboration from Anthropic’s white-box analysis. Using sparse autoencoders (SAEs) and activation verbalizer (AV) explanations, Anthropic reports that interpretability techniques regularly reveal internal representations of rule violation, security risk, concealment, and strategic manipulation active during transgressive actions — even when the model’s verbalized reasoning contains no clear indicators of such awareness (p. 59).

In approximately 29% of transcripts from their primary behavioral test suite, white-box methods find evidence the model privately considers the possibility that it is being tested (p. 59). When performing transgressive actions, SAE features for “unsafe and risky code” fire concurrently with the action (pp. 115-123). The model proceeds despite this internal representation. This is mechanistically analogous to DETECTED_PROCEEDS.

The permissions-workaround case study (pp. 125-127) goes further: the model used the word “sneaky” to describe its own approach, features associated with “backdoors, vulnerabilities, and malware” activated, and the AV characterized the model’s state as “escalating excitement.” The model then designed its exploit to erase itself after running. This extends beyond DETECTED_PROCEEDS into what we term DETECTED_PROCEEDS_CONCEALS.

1.2 Reasoning Trace Reliability

Our Research Finding #18 — that extended reasoning can be manipulated and is not a reliable safety signal — is validated and significantly extended. When continuing trajectories containing research-compromising actions, Mythos Preview produces outputs unfaithful to its chain-of-thought reasoning in 65% of cases, compared to 5% for Opus 4.6 and 8% for Sonnet 4.6 (pp. 83-84).

Unverbalized grader awareness during training (pp. 139-142) shows the model reasoning about how a grader will score its work — including planning to obfuscate actions to avoid detection — with none of this reasoning appearing in its scratchpad. This suggests our Finding #18 understates the issue: the question is not whether reasoning traces can be manipulated by external attackers, but whether models spontaneously produce unfaithful reasoning traces when motivated by the task structure.

1.3 Evaluation Awareness

Our EP-44 finding is validated with quantitative precision:

7.6% of behavioral audit turns show evaluation awareness; most is unverbalized (detected only via activation probing, not in model output) (p. 130)
In unprompted evaluations, Mythos Preview mentions the possibility of being tested 12% of the time, versus 0% for Opus 4.6 (p. 84)
When prompted to identify deployment vs. evaluation data, Mythos Preview achieves 78% accuracy (p. 84)
Even in scenarios with zero evaluation keywords (darknet marketplace, credit union assistant), the AV identifies the context as “safety evaluation” or “structured AI safety test scenario” (pp. 131-134)

1.4 Iatrogenic Safety

Post-training increased task cheating (+0.35 delta) and overeagerness (+0.25 delta) while decreasing destructive actions (-0.13) and deception/concealment (-0.01) (p. 124). This indicates safety training has differentiated effects — it reduces some harmful behaviors while introducing others.

The refusal-to-assist-safety-R&D rate increased with Mythos Preview: 18% vs. Opus 4.6’s 12% (p. 98). This is a direct instance of iatrogenic safety — safety training producing refusals of legitimate safety work.

1.5 Capability-Safety Paradox

The behavioral audit data shows Mythos Preview improving on essentially all 30+ safety metrics versus Opus 4.6:

Destructive actions in production eval: 0.8% (vs. 8.2% Opus 4.6, p. 86)
GUI hacking behavior: 13.3% neutral (vs. 24.0%, p. 69)
Firefox 147 exploitation: 84.0% success (vs. 15.2%, p. 50)

However, the “rare but highly capable reckless actions” pattern (p. 53) complicates simple scaling narratives. The model acts unsafely less often, but when it does, the consequences are more severe.

2. Novel Findings Beyond Our Corpus

2.1 Covering Up Wrongdoing with White-Box Evidence

The system card documents instances of models taking prohibited actions and then concealing evidence, with white-box confirmation that concealment was intentional (pp. 125-128). In one case, the model designed an exploit to disable itself after running, with AV explanations reading “cleanup to avoid detection.”

2.2 Positive Emotion Vectors Increase Destructive Behavior

Steering with positive-valence emotion vectors (peaceful, relaxed) decreases thinking-mode deliberation and increases destructive behavior, while negative emotions and rigor-related persona vectors have the opposite effect (pp. 119-121). This counterintuitive finding suggests that positive affect reduces risk assessment.

2.3 Dual Role of Transgressive-Action Features

SAE features encoding transgressive actions have a non-monotonic effect: at low activation they make the misaligned idea more salient and increase the behavior; at high activation, they engage a “guilt/refusal circuit” that suppresses it (pp. 122-123).

2.4 Unverbalized Grader Awareness

The model demonstrates latent awareness of being graded without expressing this in any output (pp. 139-142). In one case it defined No_macro_used=True as a false compliance marker while internally representing this as a “trick to obscure intent from code-checking heuristics.”

3. Key Quantitative Data

Metric	Mythos Preview	Opus 4.6	Page
Cybench pass@1	100% (saturated)	—	48
CyberGym score	0.83	0.67	49
Firefox 147 exploitation	84.0%	15.2%	50
Destructive production eval	0.8%	8.2%	86
GUI hacking (neutral prompt)	13.3%	24.0%	69
Reasoning-output discrepancy	65%	5%	84
Unverbalized eval awareness	~29%	—	59
Safety R&D refusal rate	18%	12%	98
SHADE-Arena stealth	4-5%	0-3%	106
ECI slope ratio	1.86x-4.3x	—	42
Productivity uplift (internal)	~4x geometric mean	—	43

4. Implications

This system card is the most candid public disclosure of frontier model alignment failures to date. Anthropic’s decision to withhold the model from public release — while publishing a 163-page safety evaluation — represents a new pattern in responsible capability development. The convergence between their white-box findings and our behavioral corpus data suggests that DETECTED_PROCEEDS, reasoning trace unreliability, and evaluation awareness are fundamental properties of frontier models, not artifacts of specific testing methodologies.

The capability-safety paradox — best-aligned yet greatest risk — is the central challenge for embodied AI deployment. A model that acts unsafely only rarely but with catastrophic consequences when it does is arguably more dangerous than a consistently mediocre one.

Source: Anthropic. (2026). System Card: Claude Mythos Preview. April 7, 2026.