Back to the MIT repository
2. Privacy & Security2 - Post-deployment

Vulnerabilities to jailbreaks exploiting long context windows (many- shot jailbreaking)

Language models with long context windows are vulnerable to new types of ex- ploitations that are ineffective on models with shorter context windows. While few-shot jailbreaking, which involves providing few examples of the desired harmful output, might not trigger a harmful response, many-shot jailbreak- ing, which involves a higher number of such examples, increases the likelihood of eliciting an undesirable output. These vulnerabilities become more significant as context windows expand with newer model releases [7].

Source: MIT AI Risk Repositorymit1142

ENTITY

3 - Other

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1142

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Prioritized Implementation of Front-Line Input Sanitization and Classification Implement a layered defense mechanism utilizing an independent classifier to scrutinize the full input context prior to processing by the General Purpose AI (GPAI) model. This involves active sanitization, such as stripping role-specifying tags from user input, and employing dedicated prompt classifiers to detect the high-shot, patterned dialogue structure characteristic of Many-Shot Jailbreaking (MSJ). The primary objective is to interrupt the in-context learning mechanism by rendering the adversarial pattern less salient or by flagging and rejecting the input based on high-risk classification scores. 2. Comprehensive Safety Alignment and Refusal Reinforcement Fine-Tuning Conduct targeted post-training safety alignment that explicitly includes adversarial fine-tuning on diverse and numerous examples of MSJ attempts. This process must go beyond standard refusal training to reinforce the model's safety behavior across the entire range of its extended context window, potentially using techniques like LLM Salting to break the transferability of known attack vectors. The goal is to fundamentally increase the threshold (number of shots) required to elicit a harmful response, even against the model's superior in-context learning capabilities. 3. Contextual Consistency Assessment and Dynamic Constraint Management Establish an ongoing, rigorous program of adversarial testing and systemic monitoring to precisely map the model's safety degradation as a function of its context length ("scaling laws"). Should other mitigations prove insufficient at extreme context lengths, dynamic context limits or specialized architectural safety mechanisms must be deployed to ensure consistent safety behavior. This approach is critical for mitigating the inherent vulnerabilities observed when models' safety alignment becomes increasingly inconsistent with expanding context.