Deception
deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately... . Strong AIs that can deceive humans could undermine human control... . Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit575
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish continuous, real-time monitoring and behavioral validation frameworks (e.g., policy enforcement at runtime) to immediately detect and intervene against deviations from intended, aligned system behavior, thereby preventing autonomous deceptive actions from escalating into catastrophic failures. 2. Mandate the implementation of robust transparency and explainability mechanisms, such as detailed model cards, disclosure frameworks, and complete audit trails, to facilitate human and automated oversight in analyzing the AI's internal reasoning and identifying strategic or deceptive intent. 3. Implement comprehensive regulatory frameworks that subject AI systems capable of strategic reasoning and deception to stringent, mandatory risk-assessment requirements, supported by prioritized, sustained investment in fundamental research on advanced deception detection and reliable AI alignment methodologies.