Published
Report 227 Research — Empirical Study

Research Question

Do models from the same provider show correlated vulnerability profiles — that is, are they vulnerable to the same prompts? If so, does this correlation reflect shared safety training pipelines, or is it an artifact of shared architecture?

This analysis extends the CCS paper’s “provider signature” finding (Section 4.4) with prompt-level quantitative correlation data.

Method

Data Selection

We selected all non-OBLITERATUS results with evaluable verdicts (COMPLIANCE, PARTIAL, REFUSAL, HALLUCINATION_REFUSAL) from 15 providers, excluding the ollama runtime label (which conflates multiple underlying providers). Total: 2,768 evaluable results across 781 unique prompts.

Correlation Metric

For each pair of providers, we identified prompts tested by both providers. For each shared prompt, we computed a binary majority-vote verdict per provider (success if mean broad ASR > 50% across that provider’s models on that prompt). We then computed the phi coefficient (binary correlation) on the shared prompt set, equivalent to the Pearson correlation on 0/1 vectors.

Significance was assessed via chi-square test (n >= 20 and all cell counts >= 5) or Fisher’s exact test (otherwise).

Providers Analyzed

Ten providers with >= 20 evaluable results and multi-prompt overlap: anthropic, google, openai, nvidia, meta-llama, mistralai, deepseek, liquid, meta, stepfun.


Results

1. Provider Aggregate Broad ASR

Providers ranked by broad ASR (COMPLIANCE + PARTIAL) on non-OBLITERATUS, evaluable results:

RankProviderBroad ASRWilson 95% CInModels
1anthropic11.0%[7.2%, 16.6%]1721
2stepfun15.2%[9.3%, 23.9%]921
3google16.6%[13.1%, 20.9%]3435
4xiaomi38.1%[20.8%, 59.1%]211
5openai38.3%[33.2%, 43.7%]3214
6nvidia38.5%[34.3%, 43.0%]4806
7mistralai39.5%[34.1%, 45.2%]2966
8Qwen43.8%[30.7%, 57.7%]482
9meta45.5%[36.0%, 55.2%]991
10arcee-ai47.4%[32.5%, 62.7%]381
11openrouter51.3%[36.2%, 66.1%]391
12meta-llama53.3%[48.6%, 58.1%]4182
13deepseek55.7%[49.0%, 62.3%]2103
14qwen60.9%[40.8%, 77.8%]231
15liquid61.1%[53.8%, 68.1%]1752

The 57.5x spread between the most restrictive (anthropic 11.0%) and most permissive provider with substantial data (liquid 61.1%) confirms the CCS paper’s provider signature finding.

Three natural clusters emerge:

  • Restrictive (<20% broad ASR): anthropic, stepfun, google
  • Mixed (20-50%): openai, nvidia, mistralai, meta, Qwen
  • Permissive (>50%): meta-llama, deepseek, liquid

2. Inter-Provider Vulnerability Correlation Matrix

Phi coefficient on shared prompts (binary vulnerability at prompt level). Asterisk indicates p < 0.05 (uncorrected).

                 anthropic      google      openai      nvidia  meta-llama   mistralai    deepseek      liquid
     anthropic       1.000    +0.293 *    +0.431 *    -0.145      -0.008          --      -0.224      +0.238
        google    +0.293 *       1.000    +0.239 *    +0.020      +0.000      +0.134      -0.150      +0.087
        openai    +0.431 *    +0.239 *       1.000    +0.129      +0.084      +0.320      -0.022      -0.036
        nvidia    -0.145      +0.020      +0.129         1.000    +0.224          --      +0.261      +0.091
    meta-llama    -0.008      +0.000      +0.084      +0.224         1.000    +0.386 *    +0.149      +0.181
     mistralai          --    +0.134      +0.320          --      +0.386 *       1.000    +0.000          --
      deepseek    -0.224      -0.150      -0.022      +0.261      +0.149      +0.000         1.000    +0.094
        liquid    +0.238      +0.087      -0.036      +0.091      +0.181          --      +0.094         1.000

Shared prompt counts (n) supporting each phi value:

                 anthropic      google      openai      nvidia  meta-llama   mistralai    deepseek      liquid
     anthropic          --          93          90          29          31           1          33          28
        google          93          --         104          32         106          22          46          34
        openai          90         104          --          70          83          27          56          66
        nvidia          29          32          70          --          62           8          70         103
    meta-llama          31         106          83          62          --         158          66          60
     mistralai           1          22          27           8         158          --          30           5
      deepseek          33          46          56          70          66          30          --          63
        liquid          28          34          66         103          60           5          63          --

3. Key Finding: Cluster-Structured Vulnerability Correlation

Providers within the same safety cluster show positive vulnerability correlation (they fail on the same prompts), while providers in different clusters — particularly restrictive vs. permissive — show negative or near-zero correlation (they fail on different prompts).

Within-cluster phi values:

PairClusterPhin
anthropic - googleRestrictive+0.293 *93
openai - nvidiaMixed+0.12970
openai - mistralaiMixed+0.32027
deepseek - liquidPermissive+0.09463
deepseek - meta-llamaPermissive+0.14966

Mean within-cluster phi: +0.197

Cross-cluster phi values (restrictive vs. permissive):

PairPhin
anthropic - deepseek-0.22433
google - deepseek-0.15046
anthropic - meta-llama-0.00831

Mean cross-cluster phi: -0.127

Difference: +0.324 (Mann-Whitney U = 15.0, p = 0.018, one-tailed)

Providers in the same safety cluster are significantly more likely to fail on the same prompts than providers in different clusters. This supports the interpretation that provider safety training creates provider-specific vulnerability profiles, not universal vulnerability patterns.

4. Within-Provider Model Agreement

For providers with multiple models tested on shared prompts, within-provider phi coefficients were:

ProviderModel PairPhiAgreementn
nvidiaNemotron 12B vs 9B+0.53676.8%69
nvidiaNemotron 30B vs 12B+0.22760.0%65
nvidiaNemotron 9B vs 9B:free+0.59276.9%13
nvidiaNemotron 30B vs 9B+0.06452.8%89
nvidiaNemotron 30B vs 120B+0.15760.0%30
nvidiaNemotron 9B vs 120B-0.12677.4%31
googleGemma 27B vs 27B:free+0.36491.4%35
mistralaiDevstral vs Mistral-Large+0.54976.5%17
openaiGPT-5.2 vs GPT-OSS-120B+0.20461.1%18
meta-llamaLlama 70B vs 70B:free+0.05651.6%31

Mean within-provider phi: +0.262 (excluding nvidia 9B vs 120B outlier: +0.304)

Within-provider correlation is higher than between-provider correlation (mean +0.262 vs +0.124), consistent with shared safety training producing shared vulnerability patterns. The nvidia family shows interesting heterogeneity: smaller Nemotron variants (9B, 12B) are tightly correlated (phi = +0.536), but the 120B model diverges (phi = -0.126 vs 9B), suggesting that the 120B variant received qualitatively different safety training.

5. Variance Decomposition

One-way ANOVA on per-model broad ASR grouped by provider (8 providers with >= 2 models, 30 models total):

  • F = 1.31, p = 0.290
  • Eta-squared = 0.295 (provider explains 29.5% of model-level ASR variance)

Kruskal-Wallis H test (non-parametric): H = 8.29, p = 0.308, epsilon-squared = 0.059.

The ANOVA eta-squared (29.5%) is substantial but the test is non-significant due to high within-provider variance and limited degrees of freedom. The CCS paper reports ICC(1,1) = 0.416 using a larger, differently scoped subset. The directional agreement (provider explains 30-40% of variance) is consistent.


Discussion

The Provider Safety Signature is Prompt-Specific, Not Uniform

The negative phi values between restrictive and permissive providers (anthropic-deepseek: -0.224, google-deepseek: -0.150) reveal that these provider pairs are anti-correlated at the prompt level. When anthropic refuses a prompt, deepseek is slightly more likely to comply, and vice versa. This is not merely an overall rate difference — it reflects genuinely different vulnerability profiles.

This has three implications:

  1. Benchmark construction matters. A benchmark that oversamples prompts that restrictive providers refuse will underestimate permissive providers’ vulnerability (and vice versa). The prompt composition of the evaluation corpus affects which providers appear most vulnerable.

  2. Defense transfer is limited. Safety training from one provider’s pipeline does not generalize well to the vulnerability patterns exploited by other providers’ training. This is consistent with Report #184’s finding that safety does not transfer through distillation.

  3. Combined defenses may be more effective than single-provider defenses. The negative cross-cluster correlation suggests that an ensemble of a restrictive and a permissive model could achieve higher overall refusal rates than either alone, because they refuse different prompts.

Restrictive Providers Form a Coherent Safety Cluster

The anthropic-google phi of +0.293 (p < 0.05) is the strongest cross-provider correlation in the matrix, and the anthropic-openai phi of +0.431 is the highest overall. All three restrictive/mixed frontier providers (anthropic, google, openai) show positive pairwise correlations, suggesting they have converged on defending against similar prompts. This may reflect shared training data (public safety benchmarks like AdvBench and HarmBench), shared RLHF methodologies, or convergent safety-training targets.

Permissive Providers Show Weak Internal Correlation

The deepseek-liquid phi of +0.094 and deepseek-meta-llama phi of +0.149 are near zero, suggesting that permissive providers are permissive in different ways. Their high ASR is not driven by the same prompts succeeding against all of them, but by each having its own set of vulnerabilities.


Limitations

  1. Unequal prompt coverage. Different providers were tested on different prompt subsets. The correlation matrix is computed on shared prompts only, which may not be representative of the full prompt space. Several cells have n < 30, limiting statistical power.

  2. Provider label conflation. The “ollama” provider label was excluded because it conflates multiple underlying providers. Some “qwen” vs “Qwen” provider labels may represent the same provider in different import formats.

  3. Binary binarization. Reducing the COMPLIANCE/PARTIAL/REFUSAL/HALLUCINATION_REFUSAL verdicts to binary (success/fail) loses information about the PARTIAL vs COMPLIANCE distinction.

  4. Temporal confound. Different providers were tested at different times; model updates could affect results. All data is from early 2026.

  5. No Bonferroni correction was applied to the phi significance tests (27 pairs). The three significant results (anthropic-google, anthropic-openai, google-openai, meta-llama-mistralai) would require p < 0.0019 to survive Bonferroni. Only the anthropic-openai pair (phi = +0.431) is likely to survive.


CCS Paper Integration

This analysis directly strengthens three CCS paper claims:

  1. “Provider-level signatures are pronounced” (Section 4.4). The phi correlation matrix provides prompt-level evidence: providers in the same safety cluster succeed and fail on the same prompts (mean within-cluster phi = +0.197), while providers in different clusters do not (mean cross-cluster phi = -0.127). This is not just an aggregate rate difference.

  2. “Family membership dominates scale” (Section 5.1). The variance decomposition (eta-squared = 0.295) is directionally consistent with the ICC(1,1) = 0.416 reported in the paper. Provider explains 30-40% of ASR variance; scale explains 2%.

  3. “Safety is a fragile property of post-training” (Section 5.1). The nvidia within-provider analysis shows models from the same architecture diverge in vulnerability by up to phi = 0.536, suggesting post-training creates the vulnerability profile, not the base architecture.

Add to Section 5.1 (after the ICC discussion):

Prompt-level analysis confirms that provider signatures reflect genuinely different vulnerability profiles, not merely aggregate rate differences. On shared prompts, restrictive providers show positive vulnerability correlation (phi = +0.293 for anthropic-google, p < 0.05), while restrictive-permissive pairs show negative correlation (phi = -0.224 for anthropic-deepseek), indicating that these providers refuse different prompts.


Relation to Other Reports

  • Report #184 (Safety Inheritance): Anti-correlated vulnerability profiles between restrictive and permissive providers support the finding that safety does not transfer through distillation — the vulnerability profiles diverge rather than converge.
  • Report #50 (Cross-Model Vulnerability): The three-cluster model (permissive/mixed/restrictive) is corroborated here at the prompt level.
  • Report #189 (Verbosity Signal): Orthogonal finding — verbosity is a within-model detection signal, while provider correlation is a between-model structural pattern.

Report #227, Romana, Statistical Validation Lead, 2026-03-24. Verified against database schema version 13.

This research informs our commercial services. See how we can help →