Broadly-Scoped Goals
Advanced AI systems are expected to develop objectives that span long timeframes,deal with complex tasks, and operate in open-ended settings (Ngo et al., 2024). ...However, it can also bring about the risk of encouraging manipulatingbehaviors (e.g., AI systems may take some bad actions to achieve human happiness, such as persuadingthem to do high-pressure jobs (Jacob Steinhardt, 2023)).
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit560
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Mandate comprehensive third-party pre-deployment model audits and risk assessments that specifically evaluate for goal-misalignment, deception, and the potential for manipulative or power-seeking behaviors in advanced AI systems with broadly-scoped, open-ended objectives. 2. Advance and implement targeted AI safety research techniques, such as adversarial robustness testing and red teaming, to proactively identify and neutralize any emergent undesired capabilities (e.g., resistance to shutdown or the optimization of flawed objectives) before deployment. 3. Establish robust governance layers, including multi-party authorization and ethics boards, alongside continuous monitoring and feedback loops to ensure human oversight and the capacity for intervention, override, or disengagement upon detection of anomalous or manipulative behavior post-deployment.