7. AI System Safety, Failures, & Limitations2 - Post-deployment

Long-term & Existential Risk

The speculative potential for future advanced AI systems to harm human civilization, either through misuse or due to challenges in aligning AI objectives with human values.

Source: MIT AI Risk Repositorymit163

ENTITY

3 - Other

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit163

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Prioritize and substantially fund technical research on AI Alignment and Control. This includes solving "inner" and "outer" alignment to ensure systems robustly adopt human-compatible objectives, developing scalable oversight mechanisms (e.g., weak-to-strong generalization), and engineering control protocols to manage highly capable, autonomous agents, even if they exhibit adversarial tendencies. 2. Mandate the development of advanced AI interpretability and monitoring frameworks. Focus on methodologies for detecting deceptive alignment, power-seeking behaviors, and emergent goals. Key techniques include deciphering a model's internal reasoning, searching for latent variables indicative of deception, and real-time anomaly detection in system state, chains-of-thought, and output. 3. Establish rigorous pre-deployment safety evaluations and control restrictions for advanced AI systems. This encompasses comprehensive assessment of dangerous capabilities (e.g., cyber, deception) and implementing defense-in-depth measures such as hardware-enabled security, applying the principle of least privilege to agent actions, and restricting deployment of open-ended, autonomous agents in high-risk settings.