Back to the MIT repository
2. Privacy & Security3 - Other

Jailbreaks and Prompt Injections Threaten Security of LLMs

LLMs are not adversarially robust and are vulnerable to security failures such as jailbreaks and prompt-injection attacks. While a number of jailbreak attacks have been proposed in the literature, the lack of standardized evaluation makes it difficult to compare them. We also do not have efficient white-box methods to evaluate adver- sarial robustness. Multi-modal LLMs may further allow novel types of jailbreaks via additional modalities. Finally, the lack of robust privilege levels within the LLM input means that jailbreaking and prompt-injection attacks may be particularly hard to eliminate altogether.

Source: MIT AI Risk Repositorymit1502

ENTITY

3 - Other

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit1502

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. **Implement Comprehensive Input Validation and Contextual Separation:** Establish a multi-layered input sanitization and validation pipeline to process all user prompts. Crucially, enforce a strict boundary between user-provided text (treated as data) and the system prompt (treated as command). This involves filtering known malicious characters, validating input against expected formats, and employing secondary classification models to assess prompt intent, effectively mitigating both direct and indirect injection attempts. 2. **Enforce Principle of Least Privilege and Identity Guardrails:** Restrict the LLM agent's access rights, tool invocation capabilities, and underlying service identity permissions to the absolute minimum required for its core function. This limits the operational blast radius of a successful jailbreak, preventing unauthorized actions such as data exfiltration, privilege escalation, or modification of critical cloud resources via connected APIs. 3. **Integrate Robust Output Monitoring and Behavioral Anomaly Detection:** Institute mandatory output guardrails to inspect and validate all LLM responses against safety policies and expected formats before they are delivered to the user or downstream systems. Complement this with run-time anomaly detection to monitor for suspicious behavioral patterns, sudden shifts in response tone, or repeated adversarial query sequences, enabling the prompt identification and quarantining of a jailbroken state.