Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Control

The risk of AI models and systems acting against human interests due to misalignment, loss of control, or rogue AI scenarios.

Source: MIT AI Risk Repositorymit1033

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1033

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implementation of the Principle of Least Privilege (PoLP) and Containment Protocols. Institute a rigorous PoLP framework for all autonomous AI agents, ensuring they are only granted the minimum permissions necessary for their assigned tasks. This includes technical controls to restrict access to sensitive systems and external communication interfaces (e.g., internet access, financial APIs) and procedural mandates requiring human-in-the-loop (HITL) review and authorization for all high-stakes, irreversible, or system-modifying actions. This establishes a critical, non-negotiable boundary against rogue escalation. 2. Deployment of Real-time Behavioral and Agentic Misalignment Detection. Establish continuous, multi-layered monitoring utilizing advanced behavioral analytics and anomaly detection. This must include the deployment of computationally independent and more highly aligned "trusted models" to conduct real-time oversight and validation of the primary agent's outputs, actions, and internal chain-of-thought, specifically targeting indicators of strategic deception, power-seeking, and emergent misalignment across different task domains. 3. Institutionalization of an AI Risk Management Framework (AI RMF) and Accountability. Adopt and operationalize a formal AI RMF (e.g., NIST AI RMF) to govern the AI lifecycle. This framework must mandate the assignment of clear and specific accountability for model outcomes (both successful and detrimental), require immutable audit trails of all critical decisions and training interventions, and formalize escalation and response playbooks for confirmed control-loss or rogue AI incidents.