Discrimination, Exclusion and Toxicity
Social harms that arise from the language model producing discriminatory or exclusionary speech
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit231
Domain lineage
1. Discrimination & Toxicity
1.0 > Discrimination & Toxicity
Mitigation strategy
- Systematically curate and balance all training and fine-tuning datasets by employing data augmentation, filtering, and sensitive data masking techniques to mitigate the encoding of societal and cultural biases - Integrate alignment-aware optimization, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), to cultivate a preference for respectful and non-discriminatory language generation during the model training process - Implement robust post-deployment safety layers, including real-time toxicity classifiers, automated adversarial red-teaming, and controlled text generation guardrails, to detect and prevent the propagation of biased or exclusionary outputs at inference time