1. Discrimination & Toxicity2 - Post-deployment

Violation of social norms

Second, because LLMs are trained on internet text data, there is also a risk that model weights encode functions which, if deployed in particular contexts, would violate social norms of that context. Following the principles of contextual integrity, it may be that models deviate from information sharing norms as a result of their training. Overcoming this challenge requires two types of infrastructure: one for keeping track of social norms in context, and another for ensuring that models adhere to them. Keeping track of what social norms are presently at play is an active research area. Surfacing value misalignments between a model’s behaviour and social norms is a daunting task, against which there is also active research (see Chapter 5).

Source: MIT AI Risk Repositorymit416

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit416

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Reinforcement Learning from Human Feedback (RLHF) To align the model's generated behavior directly with established human values, social norms, and contextual expectations by training a reward model based on human preference data, thereby steering the LLM toward contextually appropriate and safe output generation. 2. Fine-tuning and Data Filtering with Ethical Curation To mitigate the encoding of norms-violating functions stemming from uncurated training data, this involves: (a) filtering pre-training data to exclude harmful or non-normative content; and (b) subsequent fine-tuning on ethically reviewed, high-quality datasets to adapt the model to specific domain ethics and high-integrity knowledge bases, reducing the predisposition for biased or unsafe content. 3. Implementation of Dynamic and Context-Aware Guardrails To establish robust post-deployment behavioral controls—such as rule-based content filters, moderation APIs, and domain-specific policies—that dynamically screen model inputs and outputs to enforce adherence to cultural, ethical, and organizational norms relevant to the operational context, in line with the principles of contextual integrity.