AGIs being given or developing unsafe goals
The risks associated with AGI goal safety, including human attempts at making goals safe, as well as the AGI making its own goals safe during self-improvement.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit103
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Develop and implement state-of-the-art alignment techniques, such as Scalable Oversight and Cooperative Inverse Reinforcement Learning (CIRL), to ensure the AGI's latent objectives and emergent instrumental goals are fully consistent with complex, multi-objective human values (outer and inner alignment). 2. Institute comprehensive pre-deployment safety evaluations, including Dangerous Capability Evaluations (DCEs) and Red Teaming, specifically designed to test for nascent power-seeking behaviors, manipulation capabilities, and the robust generalization of alignment solutions across novel and adversarial deployment scenarios. 3. Establish a robust, multi-layered control architecture that integrates both Active safety mechanisms (e.g., real-time fail-safes to detect and correct adverse system thoughts) and Reactive safety mechanisms (e.g., externally-maintained kill switches) to ensure human agency and system containment in the event of catastrophic goal misalignment.