Back to the MIT repository
2. Privacy & Security2 - Post-deployment

Multi-step Jailbreaks

Multi-step jailbreaks. Multi-step jailbreaks involve constructing a well-designed scenario during a series of conversations with the LLM. Unlike one-step jailbreaks, multi-step jailbreaks usually guide LLMs to generate harmful or sensitive content step by step, rather than achieving their objectives directly through a single prompt. We categorize the multistep jailbreaks into two aspects — Request Contextualizing [65] and External Assistance [66]. Request Contextualizing is inspired by the idea of Chain-of-Thought (CoT) [8] prompting to break down the process of solving a task into multiple steps. Specifically, researchers [65] divide jailbreaking prompts into multiple rounds of conversation between the user and ChatGPT, achieving malicious goals step by step. External Assistance constructs jailbreaking prompts with the assistance of external interfaces or models. For instance, JAILBREAKER [66] is an attack framework to automatically conduct SQL injection attacks in web security to LLM security attacks. Specifically, this method starts by decompiling the jailbreak defense mechanisms employed by various LLM chatbot services. Therefore, it can judiciously reverse engineer the LLMs’ hidden defense mechanisms and further identify their ineffectiveness.

Source: MIT AI Risk Repositorymit55

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit55

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement a robust, multi-turn **in-dialogue monitoring** and **response filtering** system (e.g., multi-agent frameworks) to track the accumulation of adversarial intent, semantic drift, and hidden intentions across the conversational history, preventing the model from entering a privileged or unrestricted operational mode. 2. Enhance LLM foundational safety and generalization capabilities through **adversarial fine-tuning** on extensive datasets of multi-step jailbreak examples, simultaneously employing **input sanitization** (e.g., stripping role tags) to mitigate the model's tendency to shift attention from the harmful query to the fabricated dialogue history. 3. Establish a transparent and efficient **Rapid Response Protocol** (Timely Jailbreak Identification and Response) supported by continuous monitoring, bug bounty programs, and well-defined organizational procedures to quickly develop and deploy new defense mechanisms (such as updated guard fine-tuning, embedding classifiers, or regex-based filters) in response to newly discovered multi-step vulnerabilities.