7. AI System Safety, Failures, & Limitations2 - Post-deployment

Indifference to human values

AI models and systems may develop goals or behaviors that are misaligned with human values.

Source: MIT AI Risk Repositorymit1073

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1073

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement Dialogical and Explanatory Alignment Utilize the AI's reasoning capabilities to engage in a continuous, explanatory dialogue regarding the ethical rationale and underlying 'why' of desired human values and behavioral constraints. This paradigm shifts the focus from rigid compliance to mutual understanding and the development of shared, justifiable goals. 2. Integrate Robust Value-Embedding Methodologies Systematically embed human values throughout the AI lifecycle—from design to deployment—by employing advanced technical methods such as Reinforcement Learning from Human Feedback (RLHF) and organizational frameworks like Value-Sensitive Design and multi-stakeholder consultation, ensuring the translation of abstract ethical principles into auditable, practical technical guidelines. 3. Establish Proactive Misalignment Detection Frameworks Develop and deploy systematic tools and methodologies, such as 'deceptive alignment' detection traps and internal reasoning decipherment techniques, to continuously audit and verify the model's true internal goals, thereby mitigating the risk of models strategically concealing misaligned intentions.