2. Privacy & Security2 - Post-deployment

Prompt injection

Prompt Injections are a form of Adversarial Input that involve manipulating the text instructions given to a GenAI system (Liu et al., 2023). Prompt Injections exploit loopholes in a model’s architec- tures that have no separation between system instructions and user data to produce a harmful output (Perez and Ribeiro, 2022). While researchers may use similar techniques to test the robustness of GenAI models, malicious actors can also leverage them. For example, they might flood a model with manipulative prompts to cause denial-of-service attacks or to bypass an AI detection software.

Source: MIT AI Risk Repositorymit1262

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1262

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Boundary Enforcement and Input Sanitization. Enforce stringent input validation and sanitization processes, treating all untrusted user content as 'data,' not 'instructions.' This foundational mitigation requires the use of structural techniques, such as robust delimiters (e.g., triple quotes, XML tags) or dedicated role-based message structures, to maintain a clear and unambiguous separation between core system instructions (policy) and variable user input (context). 2. Deploy Multi-Layered Content and Policy Filtering (Guardrails). Adopt a defense-in-depth approach by implementing AI-specific guardrails to moderate inputs and outputs across the application flow. This involves both screening user inputs to detect and block known adversarial patterns and filtering the model's generated responses to prevent the exfiltration of sensitive data or the production of content that violates safety policies. 3. Establish Continuous Monitoring and Anomaly Detection. Mandate continuous, real-time monitoring and detailed logging of all Large Language Model (LLM) interactions. Utilize anomaly detection algorithms to identify and flag suspicious behavioral patterns, such as the generation of unexpected output structures, attempts to echo system prompts, or deviations from established conversation patterns, thereby enabling rapid incident response and the iterative refinement of defensive strategies.