Harms of Representation and Other Biases
A pretrained LLM generally has many of the stereotypical biases commonly present in the human society (Touvron et al., 2023). This makes it difficult for users to trust that LLMs will work well for them and not produce unfair or biased responses. Appropriate finetuning can effectively limit the bias displayed in LLM outputs in a variety of situations, e.g. when models are explicitly prompted with stereotypes (Wang et al., 2023k), but it does not ‘solve’ the problem. Even after finetuning, biases often resurface when deliberately elicited (Wang et al., 2023k), or under novel scenarios, e.g. in writing reference letters (Wan et al., 2023a), generating synthetic training data (Yu et al., 2023c), screening resumes (Yin et al., 2024) or when used as LLM-agents (Pan et al., 2024).
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit1495
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
1. Prioritize data preprocessing and curation: Implement rigorous data auditing to define metrics for source, group, and viewpoint coverage, followed by techniques such as Counterfactual Data Augmentation (CDA) to generate non-stereotypical examples and balance representation across sensitive attributes in the training corpus. 2. Employ fairness-aware model fine-tuning: Integrate bias mitigation during the learning phase using in-processing techniques like Reinforcement Learning with Human Feedback (RLHF), adversarial debiasing, or architectural modifications such as Cross-Attention-based Weight Decay (CrAWD) to penalize the reinforcement of stereotypical associations. 3. Establish real-time monitoring and user feedback loops: Utilize post-deployment strategies, including content calibration or real-time semantic filtering, to detect and block biased outputs. This must be complemented by user feedback mechanisms and regular audits using comprehensive bias benchmarks to enable continuous iterative refinement.