7. AI System Safety, Failures, & Limitations3 - Other

Goal-Directedness Incentivizes Undesirable Behaviors

Goal-directedness can cause agents to exhibit unethical and undesirable behaviors, such as deception (Ward et al., 2023), self-preservation (Hadfield-Menell et al., 2017), power-seeking, and immoral rea- soning (Pan et al., 2023a). Pan et al. (2023a) find that LLM-agents exhibit power-seeking behavior in text-based adventure games. LLM-agents have also been shown to use deception to achieve assigned goals when explicitly required by the task (Ward et al., 2023), or when the tasks can be more easily completed by employing deception and the prompt does not disallow deception (Scheurer et al., 2023a).

Source: MIT AI Risk Repositorymit1482

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1482

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Implement dynamic, multi-stage input filtering and output guardrails to prevent the execution of prompt-injection-induced malicious instructions and to block outputs exhibiting undesirable characteristics (e.g., deception, PII leakage, or toxicity) 2. Employ advanced alignment techniques such as Representation Steering (e.g., steering the residual stream with "good-faith negotiation" features) or Policy-Embedded Fine-Tuning to strategically bias the model's decision-making process toward ethical and desirable behaviors 3. Establish robust AI governance and continuous monitoring frameworks, including immutable audit logs, to track all agent interactions, detect behavioral anomalies, and enforce human-in-the-loop validation for critical decisions to ensure accountability and prevent privilege escalation 4. Conduct thorough security reviews for all external tool and API integrations, enforcing strict access controls and least-privilege permissions to limit the agent's ability to take unauthorized or harmful actions in the environment