Back to the MIT repository
2. Privacy & Security2 - Post-deployment

Adversarial Prompts

Engineering an adversarial input to elicit an undesired model behavior, which pose a clear attack intention

Source: MIT AI Risk Repositorymit52

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit52

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement a multi-layered input validation and sanitization framework, utilizing content filtering, rate limiting, and dedicated security agents for contextual analysis to detect and neutralize adversarial instructions prior to model ingestion. 2. Deploy a robust multi-agent defense architecture for real-time output monitoring and policy enforcement, leveraging semantic analysis and confidence scoring to prevent the propagation of unauthorized or harmful generations resulting from successful prompt manipulation. 3. Establish a continuous security pipeline that incorporates structured LLM red teaming to uncover emerging attack vectors and integrates successful adversarial examples into the model's core alignment via techniques such as Adversarial Training or Constitutional AI.