Instruction Attacks
In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit454
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Prioritized Input Validation and Secure Prompt Segmentation Implement stringent input validation and sanitization to filter out known adversarial phrases and obfuscated instructions. Crucially, enforce a secure separation of the prompt into trusted system instructions and untrusted user/data input (e.g., using reserved delimiters) to ensure the model treats all external input as data, not as overriding commands. 2. Structured Instruction Hierarchy and Tuning Utilize architectural and training methods, such as Structured Instruction Tuning or an Instruction Hierarchy, to programmatically prioritize system-level instructions over any instructions present in the user or retrieved data segments. This explicitly hardens the model's core logic against goal hijacking and prompt leaking by establishing a clear, non-negotiable instruction source. 3. Advanced Adversarial Training and Safety Alignment Augment model robustness by applying iterative fine-tuning using a large, diverse set of adversarial examples and jailbreaking attempts. Employ Reinforcement Learning with Human Feedback (RLHF) or similar alignment techniques (e.g., Constitutional AI) to systematically train the model to detect and refuse compliance with malicious instructions, thereby increasing its intrinsic resistance to novel Instruction Attacks.