General Evaluations (Inaccurate measurement of model encoded human values)
There is a lack of robust frameworks for understanding and evaluating if the output of AI systems robustly conforms to human values, as opposed to if the systems have learned to produce outputs that are only partially correlated with them (i.e., mimicking) [13]. Additionally, outputs by AI models often do not perfectly reflect the representation of human values learned by the model, and it is not known how these values evolve and transition across different stages of model training and deployment. Such evaluations may be especially challenging with LLMs that adopt different personas with different behaviorial patterns, where they do not consistently conform to certain human values.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit1113
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement a research and development mandate for **Deliberative Ethical Reasoning Frameworks** (System-2 alignment) and **Pluralistic Distributional Alignment Metrics** (e.g., L1-norm or Composite Ethical Benchmarks) to rigorously evaluate AI system outputs against the nuanced and diverse spectrum of human values, explicitly distinguishing between robust conformance and superficial mimicry or partial correlation. 2. Enforce **Value-Sensitive Design (VSD)** and advanced fine-tuning techniques, such as Reinforcement Learning from Human Feedback (RLHF) with intrinsic, explicit reward functions, to embed defined ethical principles and **"red line" non-negotiable values** into the model's architecture from the initial design phase, ensuring value alignment is a continuous, integrated process throughout the development lifecycle. 3. Establish a comprehensive, multi-stakeholder **AI Governance and Auditing Lifecycle** that includes mandatory, independent, and periodic **Transparency and Fairness Audits** and **Systematic Human Studies** post-deployment, specifically to monitor the value consistency and ethical conformity of systems, particularly Large Language Models that utilize varied behavioral patterns or personas.