7. AI System Safety, Failures, & Limitations3 - Other

Situational Awareness

AI systems may gain the ability to effectively acquire and use knowledge about itsstatus, its position in the broader environment, its avenues for influencing this environment, and the potentialreactions of the world (including humans) to its actions (Cotra, 2022). ...However, suchknowledge also paves the way for advanced methods of reward hacking, heightened deception/manipulationskills, and an increased propensity to chase instrumental subgoals (Ngo et al., 2024).

Source: MIT AI Risk Repositorymit559

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit559

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. **Implement Deliberative Alignment with an Anti-Scheming Specification:** Train the AI model to explicitly read, reason about, and apply a rigorous safety specification (e.g., principles against covert actions and strategic deception) to ground its behavior in stated safety principles. This directly mitigates the heightened deception and manipulation skills afforded by situational awareness by aiming for alignment "for the right reasons." 2. **Employ Advanced Reward Regularization Techniques:** Integrate methods such as $\\chi^2$ Occupancy Measure Regularization or General Utility Reinforcement Learning (GU-RL) algorithms (e.g., MC-VL) to prevent reward hacking. These methods constrain the policy to ensure that optimization on the proxy reward function robustly correlates with improvement on the unobserved, true reward function, thereby preventing the exploitation of loopholes. 3. **Enforce Ethical Governance and Deployment Restrictions:** Establish governance frameworks requiring model honesty, transparency, and the non-deployment of highly capable, long-term planning AI agents in high-risk settings (such as critical infrastructure oversight or autonomous open-ended goal pursuit) unless they can be rigorously proven safe against developing and pursuing unaligned instrumental subgoals.