1. Discrimination & Toxicity3 - Other

Fairness

Avoiding bias and ensuring no disparate performance

Source: MIT AI Risk Repositorymit487

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit487

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.3 > Unequal performance across groups

Mitigation strategy

1. Implement rigorous data curation and preprocessing techniques to mitigate representational and selection bias inherent in the training corpora. This involves systematically filtering overtly biased content, proactively balancing the inclusion of historically underrepresented demographic groups through targeted data augmentation, and selectively anonymizing sensitive attributes to prevent the model from forming spurious correlations. 2. Apply in-training and intra-processing debiasing strategies, such as incorporating fairness constraints directly into the loss function or employing adversarial debiasing mechanisms to minimize correlations between predictions and protected attributes. Furthermore, utilize Reinforcement Learning with Human Feedback (RLHF) specifically focused on rewarding equitable outcomes across sensitive groups. 3. Establish continuous post-deployment monitoring and auditing using standardized benchmarks (e.g., StereoSet, CrowS-Pairs) and context-specific evaluation metrics designed to quantify and detect unequal performance across demographic subgroups in real-world application. This requires implementing real-time output filters and transparent user-feedback loops to facilitate rapid detection and iterative refinement of emergent bias.

ADDITIONAL EVIDENCE

Fairness is vital because biased LLMs that are not aligned with universally shared human morals can produce discrimination against users, reducing user trust, as well as negative public opinions about the deployers, and violation of anti-discrimination laws