2. Privacy & Security2 - Post-deployment

Jailbreak in LLM Malicious Use - Prompt Attacks

In the prompting and reasoning phase, dialog can push LLMs into confused or overly compliant states, raising the risk of producing harmful outputs when confronted with harmful questions. Most of the jailbreak methods in this phase are black-boxed and can be categorized into four main groups based on the type of method: Prompt Injection [154], Role Play, Adversarial Prompting, and Prompt Form Transformation.

Source: MIT AI Risk Repositorymit1518

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1518

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Robust Input Validation and Boundary Isolation This primary defense involves establishing a pre-processing layer to rigorously validate, filter, and sanitize all untrusted user inputs. Critical techniques include enforcing strict *context separation* through the use of explicit *delimiters* (e.g., structured schemas or unique character strings) to prevent the LLM from misinterpreting user data as system instructions. Furthermore, deploy content filtering mechanisms (such as regex-based or classifier-based shields) to detect and block known adversarial patterns, role-play requests, and obfuscation attempts before they reach the core model. 2. Integrate Adversarial Training and Alignment Hardening A foundational mitigation is to enhance the model's intrinsic resilience against manipulation. This is achieved by incorporating *adversarial examples*—including a diverse set of known jailbreak prompts and their desired refusal responses—into the model's safety alignment phase (e.g., during Supervised Fine-Tuning or Reinforcement Learning with Human Feedback). This process strengthens the model's internal refusal mechanisms and improves generalization, making it more robust against novel or subtly framed adversarial inputs. 3. Establish Layered Output Monitoring and Continuous Red Teaming Deploy a complementary defense by implementing *external guardrails* that perform real-time *output monitoring* and *semantic analysis* of the LLM's response. This system should flag or block content that deviates from expected behavior, violates safety policies, or attempts unauthorized actions (e.g., data exfiltration or API calls). This security posture must be continuously validated and refined through regular, systematic *red-teaming* exercises that simulate advanced adversarial prompt attacks to proactively discover and mitigate new vulnerabilities.