Lack of understanding of in-context learning in language models
In-context learning allows the model to learn a new task or improve its perfor- mance by providing examples in the prompt, without changing its weights [101]. Even though this technique is highly effective, its working mechanism is not well understood. Since many potential misuses are directly related to prompting, it becomes difficult to guarantee safety when the exact mechanism of in-context learning is not fully investigated [13].
ENTITY
3 - Other
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit1145
Domain lineage
7. AI System Safety, Failures, & Limitations
7.4 > Lack of transparency or interpretability
Mitigation strategy
1. Implement Distribution-Free Risk Control (DFRC) leveraging Dynamic Early-Exit Prediction Utilize the zero-shot model as a predictable safety baseline. Apply DFRC by enabling dynamic early exit prediction to control the maximum performance degradation permitted by in-context examples, thereby preventing the model from "overthinking" or overfitting to potentially harmful demonstrations. 2. Employ Context Validation and Quarantine Establish a systematic framework for validating and sanitizing inbound contextual information (demonstrations and user inputs). Isolate and quarantine suspect context threads to prevent the potential spread of corrupted or "poisoned" data into the model's long-term memory or future interactions. 3. Develop Contextual Adversarial Defense Mechanisms Design and deploy defense algorithms, such as ICLShield, that dynamically adjust the model's internal concept preference ratio. This involves using confidence and similarity scoring to bias the model's reliance toward known-good or "clean" demonstrations, significantly mitigating susceptibility to contextual backdoor and virtual prompt injection attacks.