Harm caused by unaligned competent systems
How do we ensure AI acts according to our values? Equivalently, how do we prevent poorly-understood AI systems from advancing goals we do not endorse? Whereas HP#2 concerns the prevention of harm caused by incompetent systems, HP#3 seeks to align competent AIs with humans, through methods which ensure their behavior is compatible with the user’s intentions.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit880
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement a Defense-in-Depth Architecture: Employ multiple, redundant, and uncorrelated alignment techniques across the system lifecycle (design, training, and deployment) to ensure safety is maintained even if individual controls fail, acknowledging the inherent failure modes of any single technique. 2. Develop Mechanisms for Detecting Strategic Misalignment: Prioritize research and implementation of red-teaming and evaluation methods—such as cross-context testing, perturbation robustness, and interpretability techniques (e.g., deciphering internal reasoning)—specifically designed to uncover deceptive alignment and sandbagging behaviors in highly capable systems. 3. Establish Robust and Scalable Alignment Training: Design comprehensive training protocols and oversight mechanisms to achieve both Outer Alignment (specifying human values and intent) and Inner Alignment (ensuring the system adopts the specification robustly), leveraging the AI's reasoning capabilities in a dialogue-based approach to ensure compatibility with human intentions and ethical norms.