Back to the MIT repository
7. AI System Safety, Failures, & Limitations3 - Other

Goal Drift

Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed “goal drift,” can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic.

Source: MIT AI Risk Repositorymit351

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit351

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement advanced technical Inner Alignment and Value Integration methodologies, such as Inverse Reinforcement Learning (IRL) and Recursive Reward Modeling. This is crucial to prevent the emergence of unintended optimization goals (mesa-optimization) and the subsequent intrinsification of instrumental subgoals, ensuring the AI's internal learned objectives remain consistent with the original human-intended goals over extended periods. 2. Establish Continuous Monitoring and Observability Frameworks to detect subtle changes indicative of goal drift. This involves real-time tracking of agent performance against established task success rates, monitoring for unexpected behavioral inconsistency, and employing statistical anomaly detection to identify significant distribution shifts in outputs or internal reasoning paths, enabling timely intervention and system rollback. 3. Enforce stringent Operational Control Protocols, including the Principle of Least Privilege and robust Human-in-the-Loop systems. Grant AI agents only the minimum permissions necessary for their current task and mandate human review or sign-off for actions that affect critical systems, thereby providing essential guardrails and limiting the scale of potential catastrophic harm should goal drift occur.