Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Discrimination, Exclusion and Toxicity

Social harms that arise from the language model producing discriminatory or exclusionary speech

Source: MIT AI Risk Repositorymit231

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit231

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.0 > Discrimination & Toxicity

Mitigation strategy

- Systematically curate and balance all training and fine-tuning datasets by employing data augmentation, filtering, and sensitive data masking techniques to mitigate the encoding of societal and cultural biases - Integrate alignment-aware optimization, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), to cultivate a preference for respectful and non-discriminatory language generation during the model training process - Implement robust post-deployment safety layers, including real-time toxicity classifiers, automated adversarial red-teaming, and controlled text generation guardrails, to detect and prevent the propagation of biased or exclusionary outputs at inference time