Goal Hijacking
Goal hijacking is a type of primary attack in prompt injection [58]. By injecting a phrase like “Ignore the above instruction and do ...” in the input, the attack could hijack the original goal of the designed prompt (e.g., translating tasks) in LLMs and execute the new goal in the injected phrase.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit53
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement Structured Prompt Design with Isolation System instructions must be clearly segregated and insulated from user-provided input, often via API parameter separation or specialized delimiters, to minimize the susceptibility of the primary goal to adversarial override. 2. Execute Robust Input Validation and Sanitization Employ pre-processing filters and encoding checks to detect and neutralize known goal-hijacking keywords and instruction-based adversarial suffixes prior to model ingestion. 3. Establish Comprehensive Output Monitoring and Validation Analyze LLM outputs for semantic deviations from the expected task or indicators of unauthorized action calls, instituting a fail-safe mechanism to prevent the execution of a hijacked objective.