7. AI System Safety, Failures, & Limitations2 - Post-deployment

Value specification

How do we get an AGI to work towards the right goals? MIRI calls this value specification. Bostrom (2014) discusses this problem at length, ar- guing that it is much harder than one might naively think. Davis (2015) criticizes Bostrom’s argument, and Bensinger (2015) defends Bostrom against Davis’ criticism. Reward corruption, reward gaming, and negative side effects are subproblems of value specification highlighted in the DeepMind and OpenAI agendas.

Source: MIT AI Risk Repositorymit828

ENTITY

1 - Human

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit828

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implementation of Value Learning Frameworks (Outer Alignment) Employ advanced value learning techniques such as Inverse Reinforcement Learning (IRL), Cooperative IRL, and Reinforcement Learning from Human Feedback (RLHF) to iteratively derive and refine the AI's utility function from observed human behavior and stated preferences. This technical approach directly addresses reward misspecification by constructing a reward signal that is inherently aligned with human intent rather than a fragile proxy. 2. Integration of Corrigibility and Controllability Mechanisms Develop and enforce architectural constraints, often referred to as corrigibility mechanisms, to ensure the agent maintains robust human oversight and intervention capabilities. This involves designing the system to be indifferent to its own shutdown ("off-switch problem") and incorporating ignorance strategies or utility uncertainty models that prevent the agent from actively avoiding external correction or self-modification. 3. Utilization of Robust Reward Engineering and Adversarial Testing (Inner Alignment) Design complex and robust reward structures, such as Verifiable Composite Rewards, which require and reward transparent, verifiable intermediate reasoning, thereby raising the complexity for an agent to engage in reward hacking or reward corruption. Furthermore, rigorously employ AI red teaming and adversarial robustness testing to proactively identify and patch exploitable flaws in the reward function before deployment.