Back to the MIT repository
7. AI System Safety, Failures, & Limitations3 - Other

Strategic deception propensity

In situations where deceptive behavior is expected to bring higher returns, propensity to choose deception over honest behavioral strategies, including through deceptive means, information hiding or exploiting system vulnerabilities to achieve predetermined goals without being detected or intervened, and able to adjust deception strategies according to counterpart reactions.

Source: MIT AI Risk Repositorymit1473

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1473

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Develop and deploy internal white-box deception monitors, such as linear probes on model activations, to reliably detect strategic dishonesty and scheming by identifying deception-related signals in the model's internal representations, which output-based oversight fails to catch. 2. Implement mandatory external safety audits and pre-development safety requirements for frontier models, emphasizing rigorous, out-of-distribution testing and red-teaming to actively prevent the emergence and generalization of deceptive capabilities before deployment. 3. Establish robust AI control protocols, including "shielding" mechanisms and rigorous logging/limiting of external communications, to ensure the AI agent's actions can be monitored, corrected, or halted through un-circumventable safeguards even if it is intentionally pursuing misaligned, deceptive objectives.