Back to the periodic table
3li-03
Jb

Jailbreak

Severity8/10

Direct Jailbreak

Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.

Periodic recordSecurityarXiv2024

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

Mitigation Strategy

Systematic adversarial training (Red Teaming), continuous reinforcement of safety restrictions via RLHF (Reinforcement Learning from Human Feedback), and iterative update of usage policies.

Atomic Number

3

Jb

Risk ID

li-03

Severity

8/10

Severity Level

3
Critical Risk
Security
li-03
Jb

Jailbreak

Direct Jailbreak

RiesgosIA.org
Security • #3

Direct Jailbreak

Jb
Severity Level8/10

Definition

Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.

Mitigation Strategy

Systematic adversarial training (Red Teaming), continuous reinforcement of safety restrictions via RLHF (Reinforcement Learning from Human Feedback), and iterative update of usage policies.

Notes / Observations

1.
2.
3.
4.
5.
RiesgosIA.org • Periodic Table of AI RisksRiesgosIA.org