Jailbreak of a multimodal model
Current generation multimodal (e.g., vision and language) GPAI models are vulnerable to adversarial jailbreak attacks. These attacks can be used to automatically induce a model to produce an arbitrary or specific output with high success rate [227]. Multimodal jailbreaks can also be used to exfiltrate a model’s context window or other model internals [18].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1137
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Model-Level Adversarial Hardening and Alignment Implement Multimodal Adversarial Training (MMAT) by fine-tuning the foundational model with joint, cross-modal adversarial perturbations and safety-aligned loss functions to intrinsically enhance robustness and generalization against sophisticated jailbreak techniques. 2. Inference-Time Control and Output Validation Establish a layered defense architecture comprising prompt-level sanitization and detection (e.g., perplexity-based filtering) followed by an inference-time safety mechanism, such as controlled decoding via a safety reward model, to actively steer the model toward refusal and compliance during generation. 3. Cross-Modal Integrity and Consistency Monitoring Integrate mechanisms for continuous monitoring of cross-modal consistency, employing techniques like topological-contrastive losses or robust fusion gating to detect and mitigate adversarial inputs that decouple the semantic alignment between different modalities (e.g., text and vision).