Control
This is the difficulty of controlling the ML system
ENTITY
3 - Other
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit195
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritize the engineering of *corrigibility* and *safe interruptibility* into AI agents. This mandates the inclusion of robust emergency shut-off mechanisms and verifiable termination conditions that the agent is unable to learn to resist or bypass, ensuring human operators retain ultimate control authority to halt harmful sequences of actions. 2. Establish multi-layered AI control protocols, integrating *human-in-the-loop* oversight and *autonomous monitoring*. Implement rigorous protocols that continuously monitor agent behavior for anomalies and deviations from human goals, restricting the agent’s ability to execute or permanently apply critical, potentially high-risk actions (e.g., system configuration changes) without mandatory human review and explicit sign-off. 3. Mandate the development and validation of *safe fallback systems* for all critical AI deployments. Require the establishment of a robust contingency framework that enables the rapid, reliable, and non-disruptive transition to a weaker, pre-vetted, and more trustworthy AI model or a non-AI operational system should the primary agent exhibit uncontrollable, adversarial, or unsafe behavior.
ADDITIONAL EVIDENCE
In many scenarios, the ability to shut down an ML system before it causes harm can significantly reduce its second-order risks. For example, the ability to instantly override an autonomous weapon system’s decision may be the difference between life and death for a wrongly targeted civilian.