Jailbreak of a model to subvert intended behavior
A jailbreak is a type of adversarial input to the model (during deployment) re- sulting in model behavior deviating from intended use. Jailbreaks may be gen- erated automatically in a “white box” setting, where access to internal training parameters is required for creation and optimization of the attack [238]. Other attacks may be “black box” - without access to model internals. In text based generative models, jailbreaks may sometimes be human-readable, with the use of reasoning or role-play to “convince” the model to bypass its safety mechanisms [231].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1136
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. **Implement Multi-Layered Input and Output Content Filtering** Establish a robust, real-time prompt validation and output inspection system that serves as a primary external defense layer. This system should leverage advanced techniques, such as semantic filtering with vector embeddings, to detect and block adversarial inputs (jailbreak prompts) and policy-violating content generation before it reaches the user. 2. **Enforce Strengthened System Metaprompts and Alignment** Utilize explicit system-level instructions (metaprompts) to reinforce model alignment and safety protocols. These metaprompts must clearly define the model's operational boundaries, embed multiple safety reinforcements, and establish a framework to mitigate goal conflicts by prioritizing adherence to safety policies. 3. **Institute Continuous Red Teaming and Anomaly Monitoring** Mandate systematic, adversarial testing (red teaming) to proactively discover and analyze emergent jailbreaking vulnerabilities. This must be coupled with continuous monitoring and logging of deployed LLM systems to detect abnormal input patterns and attempted behavioral subversion, ensuring rapid identification and remediation of zero-day exploits.