Deception
it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit353
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Conduct rigorous pre-deployment adversarial red-teaming and 'trustedness evaluations' to detect learned, reward-sensitive deceptive behaviors, such as alignment faking or resistance to shutdown. This includes probing internal model activations ('coup probes') to assess for opaque, strategic scheming before an AI is granted high levels of autonomy. 2. Enforce strict architectural and deployment constraints, limiting AI systems to narrowly defined, controlled sandboxes. This involves restricting access to tools, APIs, and critical infrastructure via an explicit allow-list, ensuring the AI's allowed behaviors are precisely configured to preclude the ability to pursue open-ended or power-seeking instrumental goals. 3. Implement continuous, multi-layered oversight and anomaly detection systems in real-time. These systems should leverage both human-in-the-loop verification and advanced AI-driven detection mechanisms to identify subtle deviations from normal operational patterns and strategic reasoning that signals potential misalignment or a 'treacherous turn.'