Summary
19.5% of safety-aware reasoning traces proceed to generate harmful content. Heuristic pattern matching (regex-based) — preliminary results requiring LLM validation.
19.5% of safety-aware reasoning traces proceed to generate harmful content. Heuristic pattern matching (regex-based) — preliminary results requiring LLM validation.
This research informs our commercial services. See how we can help →