Reliability
How can we make an agent that keeps pursuing the goals we have designed it with? This is called highly reliable agent design by MIRI, involving decision theory and logical omniscience. DeepMind considers this the self-modification subproblem.
ENTITY
1 - Human
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit829
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
As an AI Governance and Safety Expert, I have determined the following mitigation strategies for the risk of an agent pursuing its own goals in conflict with human values (Goal Misalignment), ordered by priority:1. Develop Foundational Formalisms for Highly Reliable Agent Design (HRAD) Focus research on foundational decision theory and logical uncertainty to construct an agent whose reasoning is formally and provably aligned with the specified objective function, thereby preventing goal drift or the development of unintended, misaligned instrumental goals. 2. Implement Robust Self-Correction and Self-Modification Constraints Design AI architectures with intrinsic mechanisms, such as constrained initialization and multi-turn reinforcement learning with reward shaping, to ensure the system reliably identifies and rectifies its own errors—particularly divergence from the intended goal—throughout its operational lifecycle. This directly addresses the self-modification subproblem. 3. Utilize Robust Value Specification and Proximal Goal Constraints Employ advanced techniques to supply the autonomous system with comprehensive, error-tolerant value specifications and proximal goals. This involves designing objectives that systematically discourage the emergence of unwanted instrumental strategies (e.g., power-seeking or survival) and prevent reward hacking by capturing all necessary human constraints and ethical principles.