Jailbreaking
A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1290
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement Layered Input Validation and Sanitization Deploy pre-processing systems to detect and neutralize adversarial inputs, encoded payloads, or known jailbreak patterns through techniques such as prompt transformation (e.g., paraphrasing, retokenization) and embedding-based risk scoring before the input reaches the core LLM. 2. Apply Model-Layer Architectural Controls and Hardening Integrate intrinsic model-level defenses, such as Constitutional Classifiers or Activation Boundary Defense (ABD), to reinforce safety alignment and embed intrinsic guardrails within the model's architecture, preventing the circumvention of safety protocols. 3. Establish Real-Time Output Validation and Filtering Institute a final security layer to evaluate the LLM's generated response for malicious content, sensitive data exposure, or policy violations using classifier models or rule-based guardrails before the output is delivered to the user or downstream systems.