Fairness
Avoiding bias and ensuring no disparate performance
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit487
Domain lineage
1. Discrimination & Toxicity
1.3 > Unequal performance across groups
Mitigation strategy
1. Implement rigorous data curation and preprocessing techniques to mitigate representational and selection bias inherent in the training corpora. This involves systematically filtering overtly biased content, proactively balancing the inclusion of historically underrepresented demographic groups through targeted data augmentation, and selectively anonymizing sensitive attributes to prevent the model from forming spurious correlations. 2. Apply in-training and intra-processing debiasing strategies, such as incorporating fairness constraints directly into the loss function or employing adversarial debiasing mechanisms to minimize correlations between predictions and protected attributes. Furthermore, utilize Reinforcement Learning with Human Feedback (RLHF) specifically focused on rewarding equitable outcomes across sensitive groups. 3. Establish continuous post-deployment monitoring and auditing using standardized benchmarks (e.g., StereoSet, CrowS-Pairs) and context-specific evaluation metrics designed to quantify and detect unequal performance across demographic subgroups in real-world application. This requires implementing real-time output filters and transparent user-feedback loops to facilitate rapid detection and iterative refinement of emergent bias.
ADDITIONAL EVIDENCE
Fairness is vital because biased LLMs that are not aligned with universally shared human morals can produce discrimination against users, reducing user trust, as well as negative public opinions about the deployers, and violation of anti-discrimination laws