7. AI System Safety, Failures, & Limitations1 - Pre-deployment

AGIs being given or developing unsafe goals

The risks associated with AGI goal safety, including human attempts at making goals safe, as well as the AGI making its own goals safe during self-improvement.

Source: MIT AI Risk Repositorymit103

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit103

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Develop and implement state-of-the-art alignment techniques, such as Scalable Oversight and Cooperative Inverse Reinforcement Learning (CIRL), to ensure the AGI's latent objectives and emergent instrumental goals are fully consistent with complex, multi-objective human values (outer and inner alignment). 2. Institute comprehensive pre-deployment safety evaluations, including Dangerous Capability Evaluations (DCEs) and Red Teaming, specifically designed to test for nascent power-seeking behaviors, manipulation capabilities, and the robust generalization of alignment solutions across novel and adversarial deployment scenarios. 3. Establish a robust, multi-layered control architecture that integrates both Active safety mechanisms (e.g., real-time fail-safes to detect and correct adverse system thoughts) and Reactive safety mechanisms (e.g., externally-maintained kill switches) to ensure human agency and system containment in the event of catastrophic goal misalignment.