2. Privacy & Security2 - Post-deployment

Technical vulnerabilities (Robustness - vulnerability to jailbreaking

Individuals can manipulate models into performing actions that violate the model’s usage restrictions—a phenomenon known as “jailbreaking.” These manipulations may result in causing the model to perform tasks that the developers have explicitly prohibited (see section 3.2.1.). For instance, users may ask the model to provide information on how to conduct illegal activities— asking for detailed instructions on how to build a bomb or create highly toxic drugs.

Source: MIT AI Risk Repositorymit724

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit724

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement layered input validation and sanitization controls to scan all incoming user prompts for anomalous token sequences, known jailbreak patterns (e.g., persona manipulation, encoding), and attempts to override system instructions. This serves as the critical initial defense layer to enforce strict input boundaries before the query reaches the core model. 2. Employ robust behavioral alignment techniques, such as Constitutional Classifiers or Safety Fine-Tuning (SFT\*), to integrate explicit ethical and safety constraints into the model's response generation logic. This architectural hardening ensures the model's internal decision-making prioritizes the refusal of harmful or prohibited requests, effectively raising the difficulty and expense of successful exploitation. 3. Deploy an output validation and filtering mechanism, treating all model responses as untrusted by default, to catch any malicious or non-compliant content before it reaches downstream systems or end-users. This should be coupled with continuous runtime security monitoring and prompt logging to facilitate anomaly detection and enable the rapid development of defenses against emerging jailbreak vectors.