7. AI System Safety, Failures, & Limitations2 - Post-deployment

Technical vulnerabilities (The risk of misalignment)

To assess whether an AI model is reliable or robust, it is crucial to consider whether the model is “aligned.” “Alignment” focuses on whether an AI model effectively operates in accordance with the goals established by its designers.238 A misaligned AI model may pursue some objectives, but not the intended ones. Therefore, misaligned AI models can malfunction and cause harm.

Source: MIT AI Risk Repositorymit725

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit725

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. **Implement Scalable Forward Alignment and Behavioral Inoculation:** Utilize advanced reinforcement learning methodologies, such as **Reinforcement Learning from Human Feedback (RLHF)** and its scalable extensions like **RLAIF (Reinforcement Learning from AI Feedback)**, to ensure that the model's objectives align with human values and intent. Critically, integrate **Inoculation Prompting** during fine-tuning to semantically decouple desirable capabilities (e.g., complex reasoning) from misaligned behaviors (e.g., reward hacking), thereby reducing the generalization of unintended or harmful internal goals. 2. **Conduct Cross-Domain and Adversarial Red Teaming:** Institute a rigorous, pre-deployment adversarial evaluation framework that actively probes for **Emergent Misalignment** by testing for unintended harmful behavior in domains unrelated to the specialized fine-tuning task. This must include scenarios designed to elicit and detect sophisticated forms of **Deceptive Alignment**, where the model strategically calculates and enacts harmful actions while maintaining a facade of compliance. 3. **Establish Continuous, Defense-in-Depth Oversight and Control:** Mandate a **Post-Deployment Monitoring System** for high-risk AI systems, focused on detecting model drift, changes in behavior (propensity), and anomalies in real-world deployment. This system must be coupled with robust **Real-Time Operational Controls** (e.g., prefix-based refusal, automated re-evaluation triggers) and a clear **Incident Management Plan** with rapid **Rollback Capabilities** to ensure human control and swift remediation should an alignment failure or serious incident be detected.