Back to the MIT repository
1. Discrimination & Toxicity3 - Other

Bias

The training datasets of LLMs may contain biased information that leads LLMs to generate outputs with social biases

Source: MIT AI Risk Repositorymit08

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit08

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Prioritize data-level interventions by rigorously curating and augmenting training datasets to ensure demographic and cultural representativeness, actively filtering for explicit biases, and utilizing techniques such as Counterfactual Data Augmentation (CDA) to challenge and reduce stereotypical associations. 2. Implement model-level fairness-aware algorithmic adjustments during fine-tuning, such as incorporating explicit fairness constraints, adversarial training, or contrastive training, to prevent the model from over-relying on spurious, biased features for prediction. 3. Establish intra-processing control mechanisms, specifically employing self-diagnosis or self-reflection approaches (e.g., Self-Bias Mitigation in the Loop) where the LLM is prompted to autonomously assess its generated output for potential social biases and iteratively refine the response for a fairer outcome.