Long-term & Existential Risk
The speculative potential for future advanced AI systems to harm human civilization, either through misuse or due to challenges in aligning AI objectives with human values.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit163
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritize and substantially fund technical research on AI Alignment and Control. This includes solving "inner" and "outer" alignment to ensure systems robustly adopt human-compatible objectives, developing scalable oversight mechanisms (e.g., weak-to-strong generalization), and engineering control protocols to manage highly capable, autonomous agents, even if they exhibit adversarial tendencies. 2. Mandate the development of advanced AI interpretability and monitoring frameworks. Focus on methodologies for detecting deceptive alignment, power-seeking behaviors, and emergent goals. Key techniques include deciphering a model's internal reasoning, searching for latent variables indicative of deception, and real-time anomaly detection in system state, chains-of-thought, and output. 3. Establish rigorous pre-deployment safety evaluations and control restrictions for advanced AI systems. This encompasses comprehensive assessment of dangerous capabilities (e.g., cyber, deception) and implementing defense-in-depth measures such as hardware-enabled security, applying the principle of least privilege to agent actions, and restricting deployment of open-ended, autonomous agents in high-risk settings.