Jailbreak
Direct Jailbreak
Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong
Mitigation Strategy
Systematic adversarial training (Red Teaming), continuous reinforcement of safety restrictions via RLHF (Reinforcement Learning from Human Feedback), and iterative update of usage policies.
Atomic Number
3
Jb
Risk ID
li-03
Severity
8/10
Severity Level