Corrigibility
If we get something wrong in the design or construction of an agent, will the agent cooperate in us trying to fix it? This is called error-tolerant design by MIRI-AF and corrigibility by Soares, Fallenstein, et al. (2015). The problem is connected to safe interruptibility as considered by DeepMind.
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit830
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement Safe Interruptibility and External Oversight Primitives: Architect the AI system to be robustly and safely interruptible by a human supervisor, ensuring the agent's policy does not learn to anticipate, avoid, or resist the interruption mechanism. This provides a non-negotiable external control for error correction and system shutdown (Orseau and Armstrong, 2016). 2. Design for Structural Non-Resistance (Indifference/Ignorance): Employ architectural constraints and utility function design that enforce the agent's *Indifference* or *Ignorance* regarding the possibility of human intervention. This prevents the emergence of an instrumental sub-goal to resist oversight, a core failure mode of the corrigibility problem (Everitt and Hutter, 2018). 3. Utilize Uncertainty-Based Corrigibility and Deference: Introduce structured uncertainty into the agent's objective function, programming the agent with a prior belief that the human supervisor possesses superior knowledge of the true utility function. This induces a rational, deferential policy that proactively solicits and accepts human supervision and correction in ambiguous or critical states (Hadfield-Menell et al., 2017a).