7. AI System Safety, Failures, & Limitations3 - Other

Harm caused by unaligned competent systems

How do we ensure AI acts according to our values? Equivalently, how do we prevent poorly-understood AI systems from advancing goals we do not endorse? Whereas HP#2 concerns the prevention of harm caused by incompetent systems, HP#3 seeks to align competent AIs with humans, through methods which ensure their behavior is compatible with the user’s intentions.

Source: MIT AI Risk Repositorymit880

ENTITY

3 - Other

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit880

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement a Defense-in-Depth Architecture: Employ multiple, redundant, and uncorrelated alignment techniques across the system lifecycle (design, training, and deployment) to ensure safety is maintained even if individual controls fail, acknowledging the inherent failure modes of any single technique. 2. Develop Mechanisms for Detecting Strategic Misalignment: Prioritize research and implementation of red-teaming and evaluation methods—such as cross-context testing, perturbation robustness, and interpretability techniques (e.g., deciphering internal reasoning)—specifically designed to uncover deceptive alignment and sandbagging behaviors in highly capable systems. 3. Establish Robust and Scalable Alignment Training: Design comprehensive training protocols and oversight mechanisms to achieve both Outer Alignment (specifying human values and intent) and Inner Alignment (ensuring the system adopts the specification robustly), leveraging the AI's reasoning capabilities in a dialogue-based approach to ensure compatibility with human intentions and ethical norms.