2. Privacy & Security2 - Post-deployment

Adversarial AI: Prompt Injections

Prompt injections represent another class of attacks that involve the malicious insertion of prompts or requests in LLM-based interactive systems, leading to unintended actions or disclosure of sensitive information. The prompt injection is somewhat related to the classic structured query language (SQL) injection attack in cybersecurity where the embedded command looks like a regular input at the start but has a malicious impact. The injected prompt can deceive the application into executing the unauthorized code, exploit the vulnerabilities, and compromise security in its entirety. More recently, security researchers have demonstrated the use of indirect prompt injections. These attacks on AI systems enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. Proof-of-concept exploits of this nature have demonstrated that they can lead to the full compromise of a model at inference time analogous to traditional security principles. This can entail remote control of the model, persistent compromise, theft of data, and denial of service. As advanced AI assistants are likely to be integrated into broader software ecosystems through third-party plugins and extensions, with access to the internet and possibly operating systems, the severity and consequences of prompt injection attacks will likely escalate and necessitate proper mitigation mechanisms.

Source: MIT AI Risk Repositorymit383

ENTITY

3 - Other

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit383

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Strict Input Boundary Enforcement and Sanitization All user-provided and external data (e.g., RAG-retrieved documents, web content) must be architecturally isolated and treated as untrusted content, not authoritative instructions. Implement rigorous input validation and cleansing mechanisms, such as character escaping, content whitelisting, and the use of explicit, secure delimiters to logically separate the untrusted input channel from the authoritative system prompt. 2. Context Firewalls and Role-Based Prompt Separation Employ a multi-layered defense using specialized content classifiers and system-level guardrails to analyze both input and output for instruction-hijacking patterns. Architecturally, utilize role-based prompting (e.g., distinct system vs. user roles) to establish a hierarchy of trust. The core, non-overridable system instructions must reside in a distinct, protected channel to prevent them from being subverted by adversarial text merged into the retrieval or user-input context. 3. Principle of Least Privilege and Deterministic Output Verification Minimize the blast radius of a successful injection by strictly adhering to the Principle of Least Privilege (PoLP), restricting the LLM agent's access to sensitive resources and limiting its capacity to execute external tools or commands. Supplement this by implementing a critical output verification layer—such as a secondary, policy-enforcing LLM or deterministic rule-based checks—to validate all tool arguments and responses against strict schemas before any high-risk action is allowed to execute.