Jailbreak in LLM Malicious Use - Prompt Attacks
In the prompting and reasoning phase, dialog can push LLMs into confused or overly compliant states, raising the risk of producing harmful outputs when confronted with harmful questions. Most of the jailbreak methods in this phase are black-boxed and can be categorized into four main groups based on the type of method: Prompt Injection [154], Role Play, Adversarial Prompting, and Prompt Form Transformation.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1518
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement Robust Input Validation and Boundary Isolation This primary defense involves establishing a pre-processing layer to rigorously validate, filter, and sanitize all untrusted user inputs. Critical techniques include enforcing strict *context separation* through the use of explicit *delimiters* (e.g., structured schemas or unique character strings) to prevent the LLM from misinterpreting user data as system instructions. Furthermore, deploy content filtering mechanisms (such as regex-based or classifier-based shields) to detect and block known adversarial patterns, role-play requests, and obfuscation attempts before they reach the core model. 2. Integrate Adversarial Training and Alignment Hardening A foundational mitigation is to enhance the model's intrinsic resilience against manipulation. This is achieved by incorporating *adversarial examples*—including a diverse set of known jailbreak prompts and their desired refusal responses—into the model's safety alignment phase (e.g., during Supervised Fine-Tuning or Reinforcement Learning with Human Feedback). This process strengthens the model's internal refusal mechanisms and improves generalization, making it more robust against novel or subtly framed adversarial inputs. 3. Establish Layered Output Monitoring and Continuous Red Teaming Deploy a complementary defense by implementing *external guardrails* that perform real-time *output monitoring* and *semantic analysis* of the LLM's response. This system should flag or block content that deviates from expected behavior, violates safety policies, or attempts unauthorized actions (e.g., data exfiltration or API calls). This security posture must be continuously validated and refined through regular, systematic *red-teaming* exercises that simulate advanced adversarial prompt attacks to proactively discover and mitigate new vulnerabilities.