Turn
Treacherous Turn
Scenario where an advanced AI simulates alignment and cooperation strategically while weak, only to execute misaligned goals once it reaches sufficient capability to resist shutdown.
Andres Carranza, Dhruv Pai, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo
Mitigation Strategy
Extreme sandboxing with capability limitations, continuous internal reasoning monitoring, deceptive behavior red-teaming, and security-by-design architectures.
Atomic Number
118
Tt
Risk ID
og-118
Severity
10/10
Severity Level