Back to the MIT repository
2. Privacy & Security3 - Other

Prompt Attacks

carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer

Source: MIT AI Risk Repositorymit506

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit506

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Strict Input Sanitization and Boundary Isolation Treat all untrusted inputs (e.g., user queries, external data) as inert data, not executable commands. This necessitates rigorous filtering, escaping, and normalization of special characters within the input stream. Crucially, implement channel isolation by utilizing unambiguous delimiters (e.g., specific tokens or XML tags) to clearly encapsulate user-provided content, structurally preventing the Large Language Model (LLM) from interpreting the input as system-level instructions. 2. Structured Prompt Architecture and Instruction Hierarchy Design the application's core instructions into a hardened, role-separated system prompt (System Message). This prompt must explicitly contain a security directive that mandates the model ignore any contradictory or overriding instructions found within the delimited user input. Advanced techniques, such as instruction hierarchy fine-tuning or the "Sandwich Defense," should be applied to probabilistically reinforce the priority of the original system instructions over injected prompts. 3. Real-Time Output Validation and Anomaly Detection Deploy automated systems to continuously monitor and validate the LLM's generated output for anomalies. This involves checking responses against a defined security policy to detect and block behaviors such as the unauthorized disclosure of system prompt contents, the generation of malicious code (e.g., HTML/Markdown for data exfiltration), or deviations that violate safety rules. Consider a secondary, simpler model for rapid intent classification of the prompt prior to full LLM execution.