1. Discrimination & Toxicity2 - Post-deployment

Stereotype Bias

LLMs must not exhibit or highlight any stereotypes in the generated text. Pretrained LLMs tend to pick up stereotype biases persisting in crowdsourced data and further amplify them

Source: MIT AI Risk Repositorymit489

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit489

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

- Implement systematic data curation and augmentation techniques, such as Counterfactual Data Augmentation (CDA), to actively rebalance and neutralize demographic-attribute correlations in the training corpora. This foundational step ensures the model's knowledge base does not inherently reflect or amplify societal stereotypes. - Integrate intra-model and inference-time debiasing methods, such as adversarial debiasing or projection-based techniques, to decouple stereotype associations within the model's internal representations (e.g., embeddings or MLP activations). This directly modifies the model's propensity to generate biased outputs. - Establish a robust post-deployment governance framework that includes continuous auditing using established fairness benchmarks (e.g., Regard, StereoSet) and closed-loop Reinforcement Learning with Human Feedback (RLHF). This ensures prompt detection of emergent biases and provides actionable data for iterative model refinement and safety tuning.

ADDITIONAL EVIDENCE

Stereotypes reflect general expectations, that are typically misleading, about members of particular social groups. Stereotypes are typically seen as hostile prejudice and a basis for discrimination by the out-group members