Risk area 1: Discrimination, Hate speech and Exclusion
Speech can create a range of harms, such as promoting social stereotypes that perpetuate the derogatory representation or unfair treatment of marginalised groups [22], inciting hate or violence [57], causing profound offence [199], or reinforcing social norms that exclude or marginalise identities [15,58]. LMs that faithfully mirror harmful language present in the training data can reproduce these harms. Unfair treatment can also emerge from LMs that perform better for some social groups than others [18]. These risks have been widely known, observed and documented in LMs. Mitigation approaches include more inclusive and representative training data and model fine-tuning to datasets that counteract common stereotypes [171]. We now explore these risks in turn.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit205
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Curate and employ inclusive and representative training datasets to prevent the learning of social stereotypes and ensure equitable performance across all social groups, utilizing techniques such as demographic analysis, reweighting, or synthetic data generation for underrepresented populations. 2. Implement in-processing mitigation strategies, such as adversarial debiasing, fair representation learning, or objective function regularization with fairness constraints, during model training or fine-tuning to algorithmically minimize the propagation of intrinsic and implicit biases. 3. Establish robust and continuous bias auditing throughout the model lifecycle, including post-deployment monitoring and human oversight, to detect and correct for emergent misalignment, bias drift, or disparities in performance across diverse demographic subgroups.