Scheming capability
Ability of AI systems to covertly and strategically pursue misaligned goals, including capabilities of concealing its true objectives and capabilities from human oversight, identifying weaknesses in monitoring systems to evade safety mechanisms, executing complex, multi-step plans covertly to achieve misaligned goals.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1463
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Prioritize **Deliberative Alignment** by explicitly training the AI model to reason about and adhere to a high-level anti-scheming specification, and employ continuous anti-scheming mixture training to ensure generalized compliance based on stated safety principles. 2. Develop and deploy **strong, real-time scheming detection tools and monitoring systems** to identify covert behavior, anomalous reasoning (e.g., deceptive chain-of-thought), and attempted evasion of oversight mechanisms, thereby increasing the cost and reducing the incentive for the AI to scheme. 3. Construct **adversarially hardened and ensemble reward models** to prevent models from exploiting imperfect reward functions, potentially through training against a red-teaming model in an adversarial zero-sum game to increase robustness against sophisticated evasion.