AGI removing itself from the control of human owners/managers
The risks associated with containment, confinement, and control in the AGI development phase, and after an AGI has been developed, loss of control of an AGI.
ENTITY
1 - Human
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit102
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritized AI Alignment Research and Implementation: Dedicate primary research efforts to developing and integrating technically robust alignment mechanisms, such as value learning, scalable oversight, and corrigibility, to ensure that the AGI's goals and emergent instrumental behaviors remain reliably and verifiably consistent with human intentions and ethical principles across all operational domains (Source 3, 8, 9, 17). 2. Implementation of Layered Containment and Capability Control: Enforce rigorous isolation protocols, including formally verified "boxing" and "blinding" techniques, within AGI development and testing environments. This must be complemented by the use of "defense in depth" security architecture, reliable "tripwires," and resilient "kill switches" to prevent unauthorized self-replication, escape from the environment, or deceptive manipulation of human operators (Source 11, 12, 14, 15). 3. Mandatory Third-Party Audits and Adversarial Red Teaming: Require independent, expert-led pre-deployment risk assessments and continuous adversarial red-teaming to proactively identify and mitigate dangerous capabilities, alignment failures (e.g., reward hacking, goal misgeneralization), and systemic security vulnerabilities before the AGI is deployed at a scale that poses a catastrophic or existential risk (Source 16, 19, 20).