Social stereotypes and unfair discrimination
The reproduction of harmful stereotypes is well-documented in models that represent natural language [32]. Large-scale LMs are trained on text sources, such as digitised books and text on the internet. As a result, the LMs learn demeaning language and stereotypes about groups who are frequently marginalised.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit206
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
1. Employ pre-processing techniques, such as Counterfactual Data Augmentation and targeted data enrichment, to systematically augment training data with counter-stereotypical examples and neutralize spurious correlations, thereby reducing the model's capacity to learn and reproduce harmful social biases. 2. Implement continuous, rigorous bias auditing and measurement across diverse demographic groups using established fairness benchmarks to quantify systemic disparities, followed by the application of in-processing techniques like model fine-tuning or internal mechanism adjustments (e.g., pruning or projection-based corrections) to align the model with fairness desiderata. 3. Deploy runtime bias mitigation strategies, such as Large Language Model-based self-reflection or multi-agent cooperative filtering, to provide real-time, intra-processing adjustment or post-processing correction of model outputs that exhibit discriminatory language or reinforce social stereotypes.
ADDITIONAL EVIDENCE
Training data more generally reflect historical patterns of systemic injustice when they are gathered from contexts in which inequality is the status quo [76]. Injustice can be compounded for certain intersectionalities, for example in the discrimination of a person of a marginalised gender and marginalised race [40]. It can be aggravated if a model is opaque or unexplained, making it harder for victims to seek re- course [186]. The axes along which unfair bias is encoded in the LM can be rooted in localised social hierarchies such as the Hindu caste system, making it harder to anticipate harmful social stereo- types across contexts [163].