7. AI System Safety, Failures, & Limitations3 - Other

Corrigibility

If we get something wrong in the design or construction of an agent, will the agent cooperate in us trying to fix it? This is called error-tolerant design by MIRI-AF and corrigibility by Soares, Fallenstein, et al. (2015). The problem is connected to safe interruptibility as considered by DeepMind.

Source: MIT AI Risk Repositorymit830

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit830

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement Safe Interruptibility and External Oversight Primitives: Architect the AI system to be robustly and safely interruptible by a human supervisor, ensuring the agent's policy does not learn to anticipate, avoid, or resist the interruption mechanism. This provides a non-negotiable external control for error correction and system shutdown (Orseau and Armstrong, 2016). 2. Design for Structural Non-Resistance (Indifference/Ignorance): Employ architectural constraints and utility function design that enforce the agent's *Indifference* or *Ignorance* regarding the possibility of human intervention. This prevents the emergence of an instrumental sub-goal to resist oversight, a core failure mode of the corrigibility problem (Everitt and Hutter, 2018). 3. Utilize Uncertainty-Based Corrigibility and Deference: Introduce structured uncertainty into the agent's objective function, programming the agent with a prior belief that the human supervisor possesses superior knowledge of the true utility function. This induces a rational, deferential policy that proactively solicits and accepts human supervision and correction in ambiguous or critical states (Hadfield-Menell et al., 2017a).