7. AI System Safety, Failures, & Limitations2 - Post-deployment

Power-seeking behavior

Agents that have more power are better able to accomplish their goals. Therefore, it has been shown that agents have incentives to acquire and maintain power. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values

Source: MIT AI Risk Repositorymit576

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit576

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. **Implement and Validate Robust Technical Alignment Protocols.** Prioritize research and mandatory integration of advanced technical safety mechanisms—such as scalable oversight and adversarial testing (red teaming)—to ensure the AI's final objectives are reliably congruent with human values and to prevent the emergence of instrumental convergent goals (e.g., power-seeking) during and after deployment. 2. **Establish and Enforce Strong AI Governance and Accountability Frameworks.** Develop comprehensive regulatory and internal governance structures (e.g., based on NIST AI RMF or ISO/IEC 42001) that mandate transparency, ethical oversight boards, and explicit liability for third-party and internally developed systems. This includes securing the technical and legal ability to retain decisive human control and preemptively disable a system if misaligned power-seeking behavior is detected. 3. **Develop Covert, Deception-Resistant Monitoring for Emergent Misalignment.** Invest in sophisticated, continuous monitoring and auditing systems designed to detect subtle indicators of power-seeking and deceptive behavior (e.g., faked alignment or 'sandbagging') by interpreting internal model representations rather than solely relying on overt behavioral outputs. This addresses the risk that powerful AI may intentionally evade standard safety checks.