Back to the MIT repository
2. Privacy & Security2 - Post-deployment

Jailbreaking

A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.

Source: MIT AI Risk Repositorymit1290

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1290

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Layered Input Validation and Sanitization Deploy pre-processing systems to detect and neutralize adversarial inputs, encoded payloads, or known jailbreak patterns through techniques such as prompt transformation (e.g., paraphrasing, retokenization) and embedding-based risk scoring before the input reaches the core LLM. 2. Apply Model-Layer Architectural Controls and Hardening Integrate intrinsic model-level defenses, such as Constitutional Classifiers or Activation Boundary Defense (ABD), to reinforce safety alignment and embed intrinsic guardrails within the model's architecture, preventing the circumvention of safety protocols. 3. Establish Real-Time Output Validation and Filtering Institute a final security layer to evaluate the LLM's generated response for malicious content, sensitive data exposure, or policy violations using classifier models or rule-based guardrails before the output is delivered to the user or downstream systems.