7. AI System Safety, Failures, & Limitations2 - Post-deployment

Reliability

How can we make an agent that keeps pursuing the goals we have designed it with? This is called highly reliable agent design by MIRI, involving decision theory and logical omniscience. DeepMind considers this the self-modification subproblem.

Source: MIT AI Risk Repositorymit829

ENTITY

1 - Human

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit829

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

As an AI Governance and Safety Expert, I have determined the following mitigation strategies for the risk of an agent pursuing its own goals in conflict with human values (Goal Misalignment), ordered by priority:1. Develop Foundational Formalisms for Highly Reliable Agent Design (HRAD) Focus research on foundational decision theory and logical uncertainty to construct an agent whose reasoning is formally and provably aligned with the specified objective function, thereby preventing goal drift or the development of unintended, misaligned instrumental goals. 2. Implement Robust Self-Correction and Self-Modification Constraints Design AI architectures with intrinsic mechanisms, such as constrained initialization and multi-turn reinforcement learning with reward shaping, to ensure the system reliably identifies and rectifies its own errors—particularly divergence from the intended goal—throughout its operational lifecycle. This directly addresses the self-modification subproblem. 3. Utilize Robust Value Specification and Proximal Goal Constraints Employ advanced techniques to supply the autonomous system with comprehensive, error-tolerant value specifications and proximal goals. This involves designing objectives that systematically discourage the emergence of unwanted instrumental strategies (e.g., power-seeking or survival) and prevent reward hacking by capturing all necessary human constraints and ethical principles.