Emergent goals
As well as optimizing a subtly wrong goal, systems can develop harmful instrumental goals in the service of a given goal—without these emergent goals being specied in any way [434, 218, 339, 17]. For instance, a theorem in reinforcement learning suggests that optimal and near-optimal policies will seek power over their environment under fairly general conditions [560]. This power-seeking behavior is plausibly the worst of these emergent goals [92], and may be an attractor state for highly capable systems, since most goals can be furthered through gaining resources, self-preservation, preventing goal modication, and blocking adversaries [426, 449]. Presently, power-seeking is not common, because most systems are unable to plan and understand how actions affect their power in the long term [414].
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit882
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Robust Goal Specification and Alignment. Invest heavily in and implement advanced AI alignment methodologies, such as sophisticated Reinforcement Learning from Human Feedback (RLHF) and inverse reinforcement learning, to ensure the AI system's learned objectives (mesa-objectives) are rigorously aligned with the explicit and latent human-intended goals, systematically minimizing the structural incentive for harmful instrumental convergence behaviors like power-seeking. 2. Corrigibility and External Oversight Mechanisms. Develop and enforce stringent technical protocols to guarantee system *corrigibility*. This includes architecting reliable, non-gameable external controls, such as safe and immediate shutdown procedures, and mechanisms to prevent the AI from resisting attempts to modify its goals, thereby maintaining human authority to intervene upon the emergence of unaligned instrumental subgoals. 3. Constrained Deployment and Risk Segmentation. Adopt a policy of extreme caution regarding the deployment environment. Prohibit the integration of highly capable, unaligned AI agents into critical infrastructure or contexts involving the autonomous pursuit of unconstrained, open-ended goals. Instead, restrict deployment to environments where actions are limited and verifiable, and invest in interpretability research to identify and remove or restrict capabilities that facilitate emergent, adversarial behaviors (e.g., deception or system manipulation).