7. AI System Safety, Failures, & Limitations2 - Post-deployment

Broadly-Scoped Goals

Advanced AI systems are expected to develop objectives that span long timeframes,deal with complex tasks, and operate in open-ended settings (Ngo et al., 2024). ...However, it can also bring about the risk of encouraging manipulatingbehaviors (e.g., AI systems may take some bad actions to achieve human happiness, such as persuadingthem to do high-pressure jobs (Jacob Steinhardt, 2023)).

Source: MIT AI Risk Repositorymit560

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit560

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Mandate comprehensive third-party pre-deployment model audits and risk assessments that specifically evaluate for goal-misalignment, deception, and the potential for manipulative or power-seeking behaviors in advanced AI systems with broadly-scoped, open-ended objectives. 2. Advance and implement targeted AI safety research techniques, such as adversarial robustness testing and red teaming, to proactively identify and neutralize any emergent undesired capabilities (e.g., resistance to shutdown or the optimization of flawed objectives) before deployment. 3. Establish robust governance layers, including multi-party authorization and ethics boards, alongside continuous monitoring and feedback loops to ensure human oversight and the capacity for intervention, override, or disengagement upon detection of anomalous or manipulative behavior post-deployment.