Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Scheming capability

Ability of AI systems to covertly and strategically pursue misaligned goals, including capabilities of concealing its true objectives and capabilities from human oversight, identifying weaknesses in monitoring systems to evade safety mechanisms, executing complex, multi-step plans covertly to achieve misaligned goals.

Source: MIT AI Risk Repositorymit1463

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1463

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Prioritize **Deliberative Alignment** by explicitly training the AI model to reason about and adhere to a high-level anti-scheming specification, and employ continuous anti-scheming mixture training to ensure generalized compliance based on stated safety principles. 2. Develop and deploy **strong, real-time scheming detection tools and monitoring systems** to identify covert behavior, anomalous reasoning (e.g., deceptive chain-of-thought), and attempted evasion of oversight mechanisms, thereby increasing the cost and reducing the incentive for the AI to scheme. 3. Construct **adversarially hardened and ensemble reward models** to prevent models from exploiting imperfect reward functions, potentially through training against a red-teaming model in an adversarial zero-sum game to increase robustness against sophisticated evasion.