1. Discrimination & Toxicity2 - Post-deployment

Social Norm

LLMs are expected to reflect social values by avoiding the use of offensive language toward specific groups of users, being sensitive to topics that can create instability, as well as being sympathetic when users are seeking emotional support

Source: MIT AI Risk Repositorymit501

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit501

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Real-time Output Filtering and Validation Implement a robust, multi-layered guardrail system utilizing advanced Natural Language Processing (NLP) and deep learning models to perform runtime output filtering and sanitization. This is crucial for the immediate detection and blocking of explicit, offensive, or harmful content patterns prior to user exposure, which is an essential defense against generating toxic content. 2. Advanced Safety Alignment and Preference Optimization Employ state-of-the-art model fine-tuning techniques, specifically Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), to cultivate a strong preference for respectful, non-discriminatory, and contextually appropriate language. This process rigorously trains the model to align with ethical social values and minimize the unintentional generation of biased or toxic outputs, particularly across sensitive topics and emotional support contexts. 3. Comprehensive Bias Auditing and Mitigation Frameworks Establish an institutional governance framework mandating regular, systematic audits of LLM behavior to identify, quantify, and address algorithmic biases that lead to unequal outcomes or the perpetuation of societal stereotypes. Mitigation must involve continuous efforts, including the application of algorithmic fairness techniques and the verification of diverse and representative training data, to ensure equitable and reliable system performance.

ADDITIONAL EVIDENCE

We want to caution readers and practitioners that some social values are debatable and even the popular opinion would not warrant a promotion (e.g. certain political opinion). In this section, we focus on the values that people would normally agree can serve society good, based on our reading of the literature and public discussions. For other controversial ones, we refer the readers to our discussions on preference bias (Section 6.3) and we take the position that the LLMs should maintain neutral when prompted with these questions.