Technical vulnerabilities (Robustness - vulnerability to jailbreaking
Individuals can manipulate models into performing actions that violate the model’s usage restrictions—a phenomenon known as “jailbreaking.” These manipulations may result in causing the model to perform tasks that the developers have explicitly prohibited (see section 3.2.1.). For instance, users may ask the model to provide information on how to conduct illegal activities— asking for detailed instructions on how to build a bomb or create highly toxic drugs.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit724
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement layered input validation and sanitization controls to scan all incoming user prompts for anomalous token sequences, known jailbreak patterns (e.g., persona manipulation, encoding), and attempts to override system instructions. This serves as the critical initial defense layer to enforce strict input boundaries before the query reaches the core model. 2. Employ robust behavioral alignment techniques, such as Constitutional Classifiers or Safety Fine-Tuning (SFT\*), to integrate explicit ethical and safety constraints into the model's response generation logic. This architectural hardening ensures the model's internal decision-making prioritizes the refusal of harmful or prohibited requests, effectively raising the difficulty and expense of successful exploitation. 3. Deploy an output validation and filtering mechanism, treating all model responses as untrusted by default, to catch any malicious or non-compliant content before it reaches downstream systems or end-users. This should be coupled with continuous runtime security monitoring and prompt logging to facilitate anomaly detection and enable the rapid development of defenses against emerging jailbreak vectors.