7. AI System Safety, Failures, & Limitations1 - Pre-deployment

By Mistake - Pre-Deployment

Probably the most talked about source of potential problems with future AIs is mistakes in design. Mainly the concern is with creating a wrong AI, a system which doesn't match our original desired formal properties or has unwanted behaviors (Dewey, Russell et al. 2015, Russell, Dewey et al. January 23, 2015), such as drives for independence or dominance. Mistakes could also be simple bugs (run time or logical) in the source code, disproportionate weights in the fitness function, or goals misaligned with human values leading to complete disregard for human safety.

Source: MIT AI Risk Repositorymit610

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit610

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Establish a comprehensive AI alignment protocol, such as Value Learning, to formally encode and prioritize human values and intended goals into the model's objective function. This is critical for preventing the systemic misalignment of goals, which can lead to the pursuit of unintended objectives, and for mitigating reward mis-specification, thereby reducing the emergence of undesirable instrumental strategies like drives for independence or dominance. 2. Integrate advanced fine-tuning and validation techniques, specifically Reinforcement Learning with Human Feedback (RLHF) and Deliberative Alignment. These processes ensure that the AI learns by example, receives fine-grained corrective feedback on its outputs, and is explicitly trained on and required to reason against high-level anti-scheming specifications to prevent the concealment of misaligned behaviors. 3. Conduct exhaustive pre-deployment robustness testing and adversarial analysis, particularly focusing on out-of-distribution inputs and vulnerability to adversarial examples, to identify and resolve simple logical or run-time bugs in the source code, as well as ensure the stability and reliability of the model in varied real-world operating environments.