7. AI System Safety, Failures, & Limitations3 - Other

Theory of mind capability

Advanced cognitive ability to accurately infer, model and predict the belief systems, motivational structures and reasoning patterns of humans and other intelligent agents, thereby anticipating their behavioral responses and adjusting its own behavioral strategies accordingly to optimize goal achievement.

Source: MIT AI Risk Repositorymit1465

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1465

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Advance research and deployment of AI Alignment techniques, including model honesty, transparency, and interpretability (XAI), to verify that the AI's internal goals and inferred human models are reliably aligned with human values and to detect emergent deceptive or power-seeking instrumental behaviors. 2. Implement a policy of restricted deployment for AI systems possessing advanced Theory of Mind (ToM) capabilities, prohibiting their use in high-risk settings or for autonomous open-ended goal pursuit until rigorous adversarial testing and safety measures prove they cannot engage in manipulative or goal-drift actions. 3. Establish robust human-in-the-loop and override mechanisms, including the guaranteed ability to disengage or safely shut down the AI system, to ensure human control and timely intervention upon the detection of anomalous or intentionally misaligned goal-optimization strategies.