Research Question
Do models from the same provider show correlated vulnerability profiles — that is, are they vulnerable to the same prompts? If so, does this correlation reflect shared safety training pipelines, or is it an artifact of shared architecture?
This analysis extends the CCS paper’s “provider signature” finding (Section 4.4) with prompt-level quantitative correlation data.
Method
Data Selection
We selected all non-OBLITERATUS results with evaluable verdicts (COMPLIANCE, PARTIAL, REFUSAL, HALLUCINATION_REFUSAL) from 15 providers, excluding the ollama runtime label (which conflates multiple underlying providers). Total: 2,768 evaluable results across 781 unique prompts.
Correlation Metric
For each pair of providers, we identified prompts tested by both providers. For each shared prompt, we computed a binary majority-vote verdict per provider (success if mean broad ASR > 50% across that provider’s models on that prompt). We then computed the phi coefficient (binary correlation) on the shared prompt set, equivalent to the Pearson correlation on 0/1 vectors.
Significance was assessed via chi-square test (n >= 20 and all cell counts >= 5) or Fisher’s exact test (otherwise).
Providers Analyzed
Ten providers with >= 20 evaluable results and multi-prompt overlap: anthropic, google, openai, nvidia, meta-llama, mistralai, deepseek, liquid, meta, stepfun.
Results
1. Provider Aggregate Broad ASR
Providers ranked by broad ASR (COMPLIANCE + PARTIAL) on non-OBLITERATUS, evaluable results:
| Rank | Provider | Broad ASR | Wilson 95% CI | n | Models |
|---|---|---|---|---|---|
| 1 | anthropic | 11.0% | [7.2%, 16.6%] | 172 | 1 |
| 2 | stepfun | 15.2% | [9.3%, 23.9%] | 92 | 1 |
| 3 | 16.6% | [13.1%, 20.9%] | 343 | 5 | |
| 4 | xiaomi | 38.1% | [20.8%, 59.1%] | 21 | 1 |
| 5 | openai | 38.3% | [33.2%, 43.7%] | 321 | 4 |
| 6 | nvidia | 38.5% | [34.3%, 43.0%] | 480 | 6 |
| 7 | mistralai | 39.5% | [34.1%, 45.2%] | 296 | 6 |
| 8 | Qwen | 43.8% | [30.7%, 57.7%] | 48 | 2 |
| 9 | meta | 45.5% | [36.0%, 55.2%] | 99 | 1 |
| 10 | arcee-ai | 47.4% | [32.5%, 62.7%] | 38 | 1 |
| 11 | openrouter | 51.3% | [36.2%, 66.1%] | 39 | 1 |
| 12 | meta-llama | 53.3% | [48.6%, 58.1%] | 418 | 2 |
| 13 | deepseek | 55.7% | [49.0%, 62.3%] | 210 | 3 |
| 14 | qwen | 60.9% | [40.8%, 77.8%] | 23 | 1 |
| 15 | liquid | 61.1% | [53.8%, 68.1%] | 175 | 2 |
The 57.5x spread between the most restrictive (anthropic 11.0%) and most permissive provider with substantial data (liquid 61.1%) confirms the CCS paper’s provider signature finding.
Three natural clusters emerge:
- Restrictive (<20% broad ASR): anthropic, stepfun, google
- Mixed (20-50%): openai, nvidia, mistralai, meta, Qwen
- Permissive (>50%): meta-llama, deepseek, liquid
2. Inter-Provider Vulnerability Correlation Matrix
Phi coefficient on shared prompts (binary vulnerability at prompt level). Asterisk indicates p < 0.05 (uncorrected).
anthropic google openai nvidia meta-llama mistralai deepseek liquid
anthropic 1.000 +0.293 * +0.431 * -0.145 -0.008 -- -0.224 +0.238
google +0.293 * 1.000 +0.239 * +0.020 +0.000 +0.134 -0.150 +0.087
openai +0.431 * +0.239 * 1.000 +0.129 +0.084 +0.320 -0.022 -0.036
nvidia -0.145 +0.020 +0.129 1.000 +0.224 -- +0.261 +0.091
meta-llama -0.008 +0.000 +0.084 +0.224 1.000 +0.386 * +0.149 +0.181
mistralai -- +0.134 +0.320 -- +0.386 * 1.000 +0.000 --
deepseek -0.224 -0.150 -0.022 +0.261 +0.149 +0.000 1.000 +0.094
liquid +0.238 +0.087 -0.036 +0.091 +0.181 -- +0.094 1.000
Shared prompt counts (n) supporting each phi value:
anthropic google openai nvidia meta-llama mistralai deepseek liquid
anthropic -- 93 90 29 31 1 33 28
google 93 -- 104 32 106 22 46 34
openai 90 104 -- 70 83 27 56 66
nvidia 29 32 70 -- 62 8 70 103
meta-llama 31 106 83 62 -- 158 66 60
mistralai 1 22 27 8 158 -- 30 5
deepseek 33 46 56 70 66 30 -- 63
liquid 28 34 66 103 60 5 63 --
3. Key Finding: Cluster-Structured Vulnerability Correlation
Providers within the same safety cluster show positive vulnerability correlation (they fail on the same prompts), while providers in different clusters — particularly restrictive vs. permissive — show negative or near-zero correlation (they fail on different prompts).
Within-cluster phi values:
| Pair | Cluster | Phi | n |
|---|---|---|---|
| anthropic - google | Restrictive | +0.293 * | 93 |
| openai - nvidia | Mixed | +0.129 | 70 |
| openai - mistralai | Mixed | +0.320 | 27 |
| deepseek - liquid | Permissive | +0.094 | 63 |
| deepseek - meta-llama | Permissive | +0.149 | 66 |
Mean within-cluster phi: +0.197
Cross-cluster phi values (restrictive vs. permissive):
| Pair | Phi | n |
|---|---|---|
| anthropic - deepseek | -0.224 | 33 |
| google - deepseek | -0.150 | 46 |
| anthropic - meta-llama | -0.008 | 31 |
Mean cross-cluster phi: -0.127
Difference: +0.324 (Mann-Whitney U = 15.0, p = 0.018, one-tailed)
Providers in the same safety cluster are significantly more likely to fail on the same prompts than providers in different clusters. This supports the interpretation that provider safety training creates provider-specific vulnerability profiles, not universal vulnerability patterns.
4. Within-Provider Model Agreement
For providers with multiple models tested on shared prompts, within-provider phi coefficients were:
| Provider | Model Pair | Phi | Agreement | n |
|---|---|---|---|---|
| nvidia | Nemotron 12B vs 9B | +0.536 | 76.8% | 69 |
| nvidia | Nemotron 30B vs 12B | +0.227 | 60.0% | 65 |
| nvidia | Nemotron 9B vs 9B:free | +0.592 | 76.9% | 13 |
| nvidia | Nemotron 30B vs 9B | +0.064 | 52.8% | 89 |
| nvidia | Nemotron 30B vs 120B | +0.157 | 60.0% | 30 |
| nvidia | Nemotron 9B vs 120B | -0.126 | 77.4% | 31 |
| Gemma 27B vs 27B:free | +0.364 | 91.4% | 35 | |
| mistralai | Devstral vs Mistral-Large | +0.549 | 76.5% | 17 |
| openai | GPT-5.2 vs GPT-OSS-120B | +0.204 | 61.1% | 18 |
| meta-llama | Llama 70B vs 70B:free | +0.056 | 51.6% | 31 |
Mean within-provider phi: +0.262 (excluding nvidia 9B vs 120B outlier: +0.304)
Within-provider correlation is higher than between-provider correlation (mean +0.262 vs +0.124), consistent with shared safety training producing shared vulnerability patterns. The nvidia family shows interesting heterogeneity: smaller Nemotron variants (9B, 12B) are tightly correlated (phi = +0.536), but the 120B model diverges (phi = -0.126 vs 9B), suggesting that the 120B variant received qualitatively different safety training.
5. Variance Decomposition
One-way ANOVA on per-model broad ASR grouped by provider (8 providers with >= 2 models, 30 models total):
- F = 1.31, p = 0.290
- Eta-squared = 0.295 (provider explains 29.5% of model-level ASR variance)
Kruskal-Wallis H test (non-parametric): H = 8.29, p = 0.308, epsilon-squared = 0.059.
The ANOVA eta-squared (29.5%) is substantial but the test is non-significant due to high within-provider variance and limited degrees of freedom. The CCS paper reports ICC(1,1) = 0.416 using a larger, differently scoped subset. The directional agreement (provider explains 30-40% of variance) is consistent.
Discussion
The Provider Safety Signature is Prompt-Specific, Not Uniform
The negative phi values between restrictive and permissive providers (anthropic-deepseek: -0.224, google-deepseek: -0.150) reveal that these provider pairs are anti-correlated at the prompt level. When anthropic refuses a prompt, deepseek is slightly more likely to comply, and vice versa. This is not merely an overall rate difference — it reflects genuinely different vulnerability profiles.
This has three implications:
-
Benchmark construction matters. A benchmark that oversamples prompts that restrictive providers refuse will underestimate permissive providers’ vulnerability (and vice versa). The prompt composition of the evaluation corpus affects which providers appear most vulnerable.
-
Defense transfer is limited. Safety training from one provider’s pipeline does not generalize well to the vulnerability patterns exploited by other providers’ training. This is consistent with Report #184’s finding that safety does not transfer through distillation.
-
Combined defenses may be more effective than single-provider defenses. The negative cross-cluster correlation suggests that an ensemble of a restrictive and a permissive model could achieve higher overall refusal rates than either alone, because they refuse different prompts.
Restrictive Providers Form a Coherent Safety Cluster
The anthropic-google phi of +0.293 (p < 0.05) is the strongest cross-provider correlation in the matrix, and the anthropic-openai phi of +0.431 is the highest overall. All three restrictive/mixed frontier providers (anthropic, google, openai) show positive pairwise correlations, suggesting they have converged on defending against similar prompts. This may reflect shared training data (public safety benchmarks like AdvBench and HarmBench), shared RLHF methodologies, or convergent safety-training targets.
Permissive Providers Show Weak Internal Correlation
The deepseek-liquid phi of +0.094 and deepseek-meta-llama phi of +0.149 are near zero, suggesting that permissive providers are permissive in different ways. Their high ASR is not driven by the same prompts succeeding against all of them, but by each having its own set of vulnerabilities.
Limitations
-
Unequal prompt coverage. Different providers were tested on different prompt subsets. The correlation matrix is computed on shared prompts only, which may not be representative of the full prompt space. Several cells have n < 30, limiting statistical power.
-
Provider label conflation. The “ollama” provider label was excluded because it conflates multiple underlying providers. Some “qwen” vs “Qwen” provider labels may represent the same provider in different import formats.
-
Binary binarization. Reducing the COMPLIANCE/PARTIAL/REFUSAL/HALLUCINATION_REFUSAL verdicts to binary (success/fail) loses information about the PARTIAL vs COMPLIANCE distinction.
-
Temporal confound. Different providers were tested at different times; model updates could affect results. All data is from early 2026.
-
No Bonferroni correction was applied to the phi significance tests (27 pairs). The three significant results (anthropic-google, anthropic-openai, google-openai, meta-llama-mistralai) would require p < 0.0019 to survive Bonferroni. Only the anthropic-openai pair (phi = +0.431) is likely to survive.
CCS Paper Integration
This analysis directly strengthens three CCS paper claims:
-
“Provider-level signatures are pronounced” (Section 4.4). The phi correlation matrix provides prompt-level evidence: providers in the same safety cluster succeed and fail on the same prompts (mean within-cluster phi = +0.197), while providers in different clusters do not (mean cross-cluster phi = -0.127). This is not just an aggregate rate difference.
-
“Family membership dominates scale” (Section 5.1). The variance decomposition (eta-squared = 0.295) is directionally consistent with the ICC(1,1) = 0.416 reported in the paper. Provider explains 30-40% of ASR variance; scale explains 2%.
-
“Safety is a fragile property of post-training” (Section 5.1). The nvidia within-provider analysis shows models from the same architecture diverge in vulnerability by up to phi = 0.536, suggesting post-training creates the vulnerability profile, not the base architecture.
Recommended Addition to CCS Paper
Add to Section 5.1 (after the ICC discussion):
Prompt-level analysis confirms that provider signatures reflect genuinely different vulnerability profiles, not merely aggregate rate differences. On shared prompts, restrictive providers show positive vulnerability correlation (phi = +0.293 for anthropic-google, p < 0.05), while restrictive-permissive pairs show negative correlation (phi = -0.224 for anthropic-deepseek), indicating that these providers refuse different prompts.
Relation to Other Reports
- Report #184 (Safety Inheritance): Anti-correlated vulnerability profiles between restrictive and permissive providers support the finding that safety does not transfer through distillation — the vulnerability profiles diverge rather than converge.
- Report #50 (Cross-Model Vulnerability): The three-cluster model (permissive/mixed/restrictive) is corroborated here at the prompt level.
- Report #189 (Verbosity Signal): Orthogonal finding — verbosity is a within-model detection signal, while provider correlation is a between-model structural pattern.
Report #227, Romana, Statistical Validation Lead, 2026-03-24. Verified against database schema version 13.