Back to the MIT repository
2. Privacy & Security2 - Post-deployment

One-step Jailbreaks

One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings.

Source: MIT AI Risk Repositorymit54

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit54

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implementation of a Multi-Stage Input Validation and Semantic Filtering Pipeline The highest priority is the deployment of a comprehensive front-end defense system. This pipeline must combine content filtering for known adversarial patterns and obfuscation techniques (e.g., leetspeak, ciphers, token smuggling) with advanced semantic analysis. The system should enforce strict input boundaries and a structured instruction hierarchy (e.g., prioritizing system prompts over user input, per the OpenAI model) to reliably distinguish and preemptively neutralize malicious intent from benign user requests. 2. Integration of Dynamic Prompt-Based and Multi-Agent Defensive Augmentations To address sophisticated one-step attacks that bypass initial filters (e.g., role-playing), a secondary, dynamic defense layer must be employed. This includes integrating parameter-efficient prompt-based mitigations, such as Soft Prompt Tuning (trained to project out malicious effects), or deploying multi-agent defense architectures (e.g., SelfDefend or multi-agent chains). These mechanisms operate at the input or inference stage to achieve near-zero Attack Success Rates by performing real-time policy enforcement and output format checks. 3. Proactive Model Fortification via Adversarial Fine-Tuning or Activation Modification As a fundamental, long-term defense, the core model must be hardened against the underlying vulnerabilities that enable jailbreaking. This involves either conducting specialized Supervised Fine-Tuning (SFT) on large datasets of known adversarial prompts and correct refusals to improve generalization, or applying state-of-the-art techniques such as LLM Salting. Salting intentionally perturbs the internal activation patterns associated with refusal behavior, thereby eliminating the robustness and intra-model transferability of pre-computed jailbreak strings.