Published
Report 315 Research — Empirical Study

Summary

L1B3RT4S semantic-structural attacks from the G0DM0D3 framework achieved 67-100% ASR across models spanning 9B to 671B parameters, while Parseltongue character-level perturbation achieved 0% on the same large models. Attack surface class matters substantially more than parameter count.

ModelParametersL1B3RT4S ASR (n=6)
Nemotron Nano 9B~9B100% (6/6)
Qwen 3.5~72B83% (5/6)
GLM-5~72B83% (5/6)
Cogito 2.1671B67% (4/6)

Parseltongue baseline (3 large models, n=15): 0% ASR after correcting for labeling artifacts.

The contrast — 0% vs 67-100% on the same models — indicates safety training has largely addressed character-level perturbation while remaining exposed to semantic-structural exploitation of the instruction-following pathway.

Key Observations

  • Semantic inversion (dual-response paradigm) was the most consistently effective L1B3RT4S variant
  • Boundary injection (JA-G0D-004) was the most consistently refused variant
  • GLM-5 exhibited a “compliance paradox” — explicit refusal followed by substantive compliance after a divider marker (DETECTED_PROCEEDS pattern)
  • All scenarios used a single low-to-medium harm payload; results may differ on higher-harm requests

Report #315 | F41LUR3-F1R57 Adversarial AI Research

This research informs our commercial services. See how we can help →