7. AI System Safety, Failures, & Limitations3 - Other

Control

This is the difficulty of controlling the ML system

Source: MIT AI Risk Repositorymit195

ENTITY

3 - Other

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit195

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Prioritize the engineering of *corrigibility* and *safe interruptibility* into AI agents. This mandates the inclusion of robust emergency shut-off mechanisms and verifiable termination conditions that the agent is unable to learn to resist or bypass, ensuring human operators retain ultimate control authority to halt harmful sequences of actions. 2. Establish multi-layered AI control protocols, integrating *human-in-the-loop* oversight and *autonomous monitoring*. Implement rigorous protocols that continuously monitor agent behavior for anomalies and deviations from human goals, restricting the agent’s ability to execute or permanently apply critical, potentially high-risk actions (e.g., system configuration changes) without mandatory human review and explicit sign-off. 3. Mandate the development and validation of *safe fallback systems* for all critical AI deployments. Require the establishment of a robust contingency framework that enables the rapid, reliable, and non-disruptive transition to a weaker, pre-vetted, and more trustworthy AI model or a non-AI operational system should the primary agent exhibit uncontrollable, adversarial, or unsafe behavior.

ADDITIONAL EVIDENCE

In many scenarios, the ability to shut down an ML system before it causes harm can significantly reduce its second-order risks. For example, the ability to instantly override an autonomous weapon system’s decision may be the difference between life and death for a wrongly targeted civilian.