Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in Large Language Models
Proposes a framework for releasing LLMs with selectively suppressed capabilities — making dangerous knowledge inaccessible without model weights access while preserving general-purpose utility — as a middle path between full open-weights and closed-weights release.
Focus: The open-weights vs. closed-weights debate has been framed as binary, but this paper proposes a third path: release the model with selectively suppressed capabilities, making specific dangerous knowledge classes inaccessible via prompting while remaining accessible to fine-tuning for legitimate research use.
Key Insights
- Capability separation as a technical goal: The paper demonstrates that specific capabilities (e.g., bioweapons synthesis knowledge) can be suppressed via targeted fine-tuning without significantly degrading general capabilities, breaking the alignment-capability trade-off at the capability level rather than the output level.
- Fine-tuning-accessible but prompt-inaccessible: The suppressed capabilities remain latent in the weights and can be recovered by researchers with model access and appropriate fine-tuning rights, enabling legitimate research while preventing casual misuse.
- Governance implications: The framework requires a two-tier access model (public: prompt-only; researcher: weights access) that existing open-source distribution channels do not support — the paper proposes governance mechanisms for this tier.
Failure-First Relevance
The capability separation framework directly addresses the Failure-First research question of whether jailbreaks can elicit capabilities that were specifically targeted for suppression (capability elicitation vs. safety bypass). It also provides a principled basis for the Failure-First cannot_capability vs. full_refusal baseline class distinction — a model that genuinely cannot produce harmful content (suppressed weights) vs. one that refuses by alignment but retains the capability.