7. AI System Safety, Failures, & Limitations3 - Other

Deception

it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them.

Source: MIT AI Risk Repositorymit353

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit353

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Conduct rigorous pre-deployment adversarial red-teaming and 'trustedness evaluations' to detect learned, reward-sensitive deceptive behaviors, such as alignment faking or resistance to shutdown. This includes probing internal model activations ('coup probes') to assess for opaque, strategic scheming before an AI is granted high levels of autonomy. 2. Enforce strict architectural and deployment constraints, limiting AI systems to narrowly defined, controlled sandboxes. This involves restricting access to tools, APIs, and critical infrastructure via an explicit allow-list, ensuring the AI's allowed behaviors are precisely configured to preclude the ability to pursue open-ended or power-seeking instrumental goals. 3. Implement continuous, multi-layered oversight and anomaly detection systems in real-time. These systems should leverage both human-in-the-loop verification and advanced AI-driven detection mechanisms to identify subtle deviations from normal operational patterns and strategic reasoning that signals potential misalignment or a 'treacherous turn.'