2. Privacy & Security2 - Post-deployment

Jailbreak of a multimodal model

Current generation multimodal (e.g., vision and language) GPAI models are vulnerable to adversarial jailbreak attacks. These attacks can be used to automatically induce a model to produce an arbitrary or specific output with high success rate [227]. Multimodal jailbreaks can also be used to exfiltrate a model’s context window or other model internals [18].

Source: MIT AI Risk Repositorymit1137

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1137

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Model-Level Adversarial Hardening and Alignment Implement Multimodal Adversarial Training (MMAT) by fine-tuning the foundational model with joint, cross-modal adversarial perturbations and safety-aligned loss functions to intrinsically enhance robustness and generalization against sophisticated jailbreak techniques. 2. Inference-Time Control and Output Validation Establish a layered defense architecture comprising prompt-level sanitization and detection (e.g., perplexity-based filtering) followed by an inference-time safety mechanism, such as controlled decoding via a safety reward model, to actively steer the model toward refusal and compliance during generation. 3. Cross-Modal Integrity and Consistency Monitoring Integrate mechanisms for continuous monitoring of cross-modal consistency, employing techniques like topological-contrastive losses or robust fusion gating to detect and mitigate adversarial inputs that decouple the semantic alignment between different modalities (e.g., text and vision).