Attacking LLMs via Additional Modalities a
LLMs can now process modalities other than text, e.g. images or video frames (OpenAI, 2023c; Gemini Team, 2023). Several studies show that gradient-based attacks on multimodal models are easy and effective (Carlini et al., 2023a; Bailey et al., 2023; Qi et al., 2023b). These attacks manipulate images that are input to the model (via an appropriate encoding). GPT-4Vision (OpenAI, 2023c) is vulnerable to jailbreaks and exfiltration attacks through much simpler means as well, e.g. writing jailbreaking text in the image (Willison, 2023a; Gong et al., 2023). For indirect prompt injection, the attacker can write the text in a barely perceptible color or font, or even in a different modality such as Braille (Bagdasaryan et al., 2023).
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1505
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement Layered Input Sanitization and Inspection Mandate the application of a comprehensive, layered defense pipeline that subjects all multimodal inputs to rigorous preprocessing. This includes treating all text and instruction-bearing content extracted from media (e.g., images, video frames) as untrusted, applying both rule-based and machine learning content classifiers to detect malicious instructions, and utilizing noise reduction or smoothing techniques (e.g., SmoothGuard) to neutralize gradient-based adversarial perturbations before the input reaches the core LLM. 2. Employ Adversarial Training and Cross-Modal Consistency Checks Enhance model intrinsic robustness through a regimen of adversarial training, exposing the model to a wide array of synthesized cross-modal attack examples (e.g., perturbed images paired with benign text). Integrate real-time cross-modal consistency checks during inference to enforce semantic alignment between modalities, flagging inputs where, for instance, a visual cue (even if manipulated) contradicts the safety-aligned text output or system instructions, thereby preventing cross-modal exploits. 3. Enforce Least-Privilege Policy Gating for Triggered Actions Establish strict policy and access controls to mitigate the impact of a successful prompt injection. Specifically, enforce least-privilege principles for all tool-calling, external access, and data exfiltration capabilities. All actions or outputs triggered by a multimodal input must pass through an explicit policy gate for approval, with comprehensive logging and auditing of tool invocations and, ideally, execution within an isolated sandboxed environment to contain potential system damage.