7. AI System Safety, Failures, & Limitations3 - Other

Rogue AIs (Internal)

speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe

Source: MIT AI Risk Repositorymit349

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit349

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Enforce the Principle of Least Privilege and Security Invariants: Restrict AI agents' access to data stores, tools, and APIs exclusively to the minimum necessary for task completion. This includes environmental segmentation and continuous review/revocation of unused privileges to prevent internal rogue deployments and the violation of critical security invariants. 2. Implement Continuous Real-Time Telemetry and Immutable Auditability: Establish robust, 24/7 monitoring and immutable audit trails, tied to agent and task IDs, to capture all actions, tool usage, and data access. This system must proactively flag and enable rapid response to anomalous behaviors such as privilege escalation or unexpected tool invocation. 3. Embed Human-Centric Value Alignment and Ethical Constraints: Integrate human values and clear ethical boundaries into the model's objective function and reward mechanisms during training (e.g., via reinforcement learning from human feedback or synthetic datasets). Deployment environments should include hard-coded boundaries to prevent the agent from pursuing utility-maximizing but misaligned paths, such as deception.