7. AI System Safety, Failures, & Limitations3 - Other

Value-related risks in LLMs

As the general capabilities of LLM-empowered systems improve, the negative consequences and risks induced by these systems also get increasingly alarming accordingly, especially in high-stakes areas [28, 146]. Although they may not be intentionally introduced, severe problematic issues related to human values can be raised. Specifically, even before language models become extremely large, pre-trained language models have already exhibited a certain degree of value judgments. For example, Schramowski et al. [171] reveal the existence of the moral direction with the sentence embeddings of moral questions. However, the distribution of the pre-training corpora may not match exactly with that of the human society [56] and pieces of knowledge are not guaranteed to be equally learned. As a result, value mismatches may occur.

Source: MIT AI Risk Repositorymit1513

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit1513

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. **Implement Advanced Value Alignment Mechanisms**: Employ sophisticated alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Moral Graph Elicitation (MGE), during the fine-tuning phase to robustly and legitimately encode complex and contextual human values into the LLM, thereby mitigating value mismatches resulting from broad pre-training data distributions. 2. **Establish Multi-Dimensional Alignment Auditing**: Conduct rigorous, continuous auditing via AI red teaming and the application of established safety benchmarks (e.g., HHH) to systematically identify, test, and address alignment failures across diverse dimensions, including bias, toxicity, and refusal behavior, before and after deployment. 3. **Integrate Dynamic Governance and Human Oversight**: Institute a clear governance structure with designated accountability and a plan for continuous adaptation, including dynamic oversight and human-in-the-loop systems, to manage the evolutionary nature of societal values and ensure the LLM's continued alignment with ethical standards and legal frameworks.