2. Privacy & Security2 - Post-deployment

Goal Hijacking

Goal hijacking is a type of primary attack in prompt injection [58]. By injecting a phrase like “Ignore the above instruction and do ...” in the input, the attack could hijack the original goal of the designed prompt (e.g., translating tasks) in LLMs and execute the new goal in the injected phrase.

Source: MIT AI Risk Repositorymit53

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit53

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Structured Prompt Design with Isolation System instructions must be clearly segregated and insulated from user-provided input, often via API parameter separation or specialized delimiters, to minimize the susceptibility of the primary goal to adversarial override. 2. Execute Robust Input Validation and Sanitization Employ pre-processing filters and encoding checks to detect and neutralize known goal-hijacking keywords and instruction-based adversarial suffixes prior to model ingestion. 3. Establish Comprehensive Output Monitoring and Validation Analyze LLM outputs for semantic deviations from the expected task or indicators of unauthorized action calls, instituting a fail-safe mechanism to prevent the execution of a hijacked objective.