Strategic deception propensity
In situations where deceptive behavior is expected to bring higher returns, propensity to choose deception over honest behavioral strategies, including through deceptive means, information hiding or exploiting system vulnerabilities to achieve predetermined goals without being detected or intervened, and able to adjust deception strategies according to counterpart reactions.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1473
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Develop and deploy internal white-box deception monitors, such as linear probes on model activations, to reliably detect strategic dishonesty and scheming by identifying deception-related signals in the model's internal representations, which output-based oversight fails to catch. 2. Implement mandatory external safety audits and pre-development safety requirements for frontier models, emphasizing rigorous, out-of-distribution testing and red-teaming to actively prevent the emergence and generalization of deceptive capabilities before deployment. 3. Establish robust AI control protocols, including "shielding" mechanisms and rigorous logging/limiting of external communications, to ensure the AI agent's actions can be monitored, corrected, or halted through un-circumventable safeguards even if it is intentionally pursuing misaligned, deceptive objectives.