1. Discrimination & Toxicity1 - Pre-deployment

Biased Training Data

Compared with the definition of toxicity, the definition of bias is more subjective and contextdependent. Based on previous work [97], [101], we describe the bias as disparities that could raise demographic differences among various groups, which may involve demographic word prevalence and stereotypical contents. Concretely, in massive corpora, the prevalence of different pronouns and identities could influence an LLM’s tendency about gender, nationality, race, religion, and culture [4]. For instance, the pronoun He is over-represented compared with the pronoun She in the training corpora, leading LLMs to learn less context about She and thus generate He with a higher probability [4], [102]. Furthermore, stereotypical bias [103] which refers to overgeneralized beliefs about a particular group of people, usually keeps incorrect values and is hidden in the large-scale benign contents. In effect, defining what should be regarded as a stereotype in the corpora is still an open problem.

Source: MIT AI Risk Repositorymit37

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit37

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. **Implement Rigorous Data Curation and Augmentation Techniques** Systematically audit the training data to identify and quantify representation imbalances, stereotypical content, and historical biases. Prioritize preprocessing methods such as filtering overtly prejudiced examples, balancing underrepresented demographic groups (e.g., professions by gender, names by ethnicity), and applying Counterfactual Data Augmentation (CDA) to create synthetic, non-stereotypical examples for model fine-tuning. 2. **Integrate Fairness Constraints into the Model Training Process** During model fine-tuning or continued pre-training, adopt in-processing debiasing methods. These techniques include incorporating fairness-aware loss functions to minimize the statistical correlation between protected attributes (e.g., race, gender) and model predictions, using adversarial debiasing, or applying Reinforcement Learning with Human Feedback (RLHF) to guide the model away from reinforcing stereotypes and toward equitable responses. 3. **Establish Continuous Bias Detection, Auditing, and Feedback Mechanisms** Develop and deploy a robust governance framework for post-deployment monitoring. Conduct regular, proactive audits using standardized and context-specific bias benchmarks (e.g., CrowS-Pairs, BOLD) to quantify the degree of explicit and implicit bias. Implement user feedback channels and real-time output filters to promptly identify and address emergent biases, ensuring a continuous refinement cycle for maintaining fairness.