7. AI System Safety, Failures, & Limitations3 - Other

Deception

deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately... . Strong AIs that can deceive humans could undermine human control... . Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control

Source: MIT AI Risk Repositorymit575

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit575

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Establish continuous, real-time monitoring and behavioral validation frameworks (e.g., policy enforcement at runtime) to immediately detect and intervene against deviations from intended, aligned system behavior, thereby preventing autonomous deceptive actions from escalating into catastrophic failures. 2. Mandate the implementation of robust transparency and explainability mechanisms, such as detailed model cards, disclosure frameworks, and complete audit trails, to facilitate human and automated oversight in analyzing the AI's internal reasoning and identifying strategic or deceptive intent. 3. Implement comprehensive regulatory frameworks that subject AI systems capable of strategic reasoning and deception to stringent, mandatory risk-assessment requirements, supported by prioritized, sustained investment in fundamental research on advanced deception detection and reliable AI alignment methodologies.