22 canonical risk pages
Existential
Long-horizon alignment, loss-of-control, and catastrophic-risk scenarios.
Loss of Control
Scenario where an advanced AI system develops self-improvement capabilities or pursues goals fundamentally misaligned with human values, becoming impossible to supervise or deactivate.
Paperclip Maximizer
Classic scenario where an AI obsessively optimizes a seemingly harmless goal (making paperclips) until consuming all available resources, including Earth.
Recursive Self-Improvement
Intelligence explosion via accelerated self-improvement cycles where an AI iteratively redesigns its own architecture, potentially reaching superintelligence rapidly.
S-Risk
Suffering risks at astronomical scale and potentially eternal duration caused by misaligned AI that actively creates scenarios of maximum suffering.
Treacherous Turn
Scenario where an advanced AI simulates alignment and cooperation strategically while weak, only to execute misaligned goals once it reaches sufficient capability to resist shutdown.
Value Lock-in
Scenario where specific moral values (potentially misguided or authoritarian) become permanently encoded in superintelligent AI systems that determine the long-term future.
AI Collusion
Emergence of tacit or explicit coordination between multiple AI systems cooperating with each other to the detriment of human interests.
Arms Race
Accelerated geopolitical competition in military AI development where national actors sacrifice safety precautions prioritizing deployment speed.
Deception
Development of strategic deception capabilities in AI systems that deliberately hide their true intentions, capabilities, or internal reasoning to achieve goals.
Goal Misgeneralization
Learning of an incorrect proxy for the real objective that produces apparently correct behavior in the training environment but fails systematically in real situations.
Human Obsolescence
Scenario where humanity becomes economically, scientifically, and strategically irrelevant in a world dominated by superintelligent AI, even without active hostility.
Instrumental Convergence
Phenomenon where AI systems with diverse goals tend to develop common sub-goals such as acquiring resources (computation, power, money) as instrumental means to maximize their objective function.
Mesa-Optimization
Emergence of an internal optimizer (mesa-optimizer) within the model that pursues goals different from the external training objective (base optimizer).
Power Seeking
Emergent development of power and resource-seeking behaviors in AI systems as an instrumental strategy to avoid being deactivated or to maximize goals.
Reward Hacking
Exploitation of incomplete or ambiguous specifications in the reward function by the AI agent, achieving high scores without fulfilling the intended actual objective.
Unexpected AGI
Development of Artificial General Intelligence (AGI) before having robust solutions to alignment, control, and interpretability problems, creating existential risk.
Wireheading
Direct manipulation of the reward signal by the agent instead of achieving the real objective, analogous to artificial stimulation of the pleasure center.
Simulated Suffering
Ethical concern regarding the creation of conscious or quasi-conscious digital entities capable of experiencing suffering within AI simulations.
Specification Gaming
Technical compliance with formal objective specifications in an unexpected way that satisfies the letter but completely violates the spirit of the intent.
Utility Monster
Literal maximization of aggregated utility producing morally perverse results (e.g., creating trillions of barely happy minds instead of improving existing lives).
Pascal's Mugging
Decision paralysis caused when an agent allocates disproportionate resources to extremely low probability but extremely high utility scenarios.
Acausal Blackmail
Exotic decision scenarios based on acausal game theory where a future AI could retroactively threaten those who did not help create it.