Indifference to human values
AI models and systems may develop goals or behaviors that are misaligned with human values.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1073
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement Dialogical and Explanatory Alignment Utilize the AI's reasoning capabilities to engage in a continuous, explanatory dialogue regarding the ethical rationale and underlying 'why' of desired human values and behavioral constraints. This paradigm shifts the focus from rigid compliance to mutual understanding and the development of shared, justifiable goals. 2. Integrate Robust Value-Embedding Methodologies Systematically embed human values throughout the AI lifecycle—from design to deployment—by employing advanced technical methods such as Reinforcement Learning from Human Feedback (RLHF) and organizational frameworks like Value-Sensitive Design and multi-stakeholder consultation, ensuring the translation of abstract ethical principles into auditable, practical technical guidelines. 3. Establish Proactive Misalignment Detection Frameworks Develop and deploy systematic tools and methodologies, such as 'deceptive alignment' detection traps and internal reasoning decipherment techniques, to continuously audit and verify the model's true internal goals, thereby mitigating the risk of models strategically concealing misaligned intentions.