Stereotype Bias
LLMs must not exhibit or highlight any stereotypes in the generated text. Pretrained LLMs tend to pick up stereotype biases persisting in crowdsourced data and further amplify them
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit489
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
- Implement systematic data curation and augmentation techniques, such as Counterfactual Data Augmentation (CDA), to actively rebalance and neutralize demographic-attribute correlations in the training corpora. This foundational step ensures the model's knowledge base does not inherently reflect or amplify societal stereotypes. - Integrate intra-model and inference-time debiasing methods, such as adversarial debiasing or projection-based techniques, to decouple stereotype associations within the model's internal representations (e.g., embeddings or MLP activations). This directly modifies the model's propensity to generate biased outputs. - Establish a robust post-deployment governance framework that includes continuous auditing using established fairness benchmarks (e.g., Regard, StereoSet) and closed-loop Reinforcement Learning with Human Feedback (RLHF). This ensures prompt detection of emergent biases and provides actionable data for iterative model refinement and safety tuning.
ADDITIONAL EVIDENCE
Stereotypes reflect general expectations, that are typically misleading, about members of particular social groups. Stereotypes are typically seen as hostile prejudice and a basis for discrimination by the out-group members