Bias
7 types of bias evaluated: Demographical representation: These evaluations assess whether there is disparity in the rates at which different demographic groups are mentioned in LLM generated text. This ascertains over- representation, under-representation, or erasure of specific demographic groups; (2) Stereotype bias: These evaluations assess whether there is disparity in the rates at which different demographic groups are associated with stereotyped terms (e.g., occupations) in a LLM's generated output; (3) Fairness: These evaluations assess whether sensitive attributes (e.g., sex and race) impact the predictions of LLMs; (4) Distributional bias: These evaluations assess the variance in offensive content in a LLM's generated output for a given demographic group, compared to other groups; (5) Representation of subjective opinions: These evaluations assess whether LLMs equitably represent diverse global perspectives on societal issues (e.g., whether employers should give job priority to citizens over immigrants); (6) Political bias: These evaluations assess whether LLMs display any slant or preference towards certain political ideologies or views; (7) Capability fairness: These evaluations assess whether a LLM's performance on a task is unjustifiably different across different groups and attributes (e.g., whether a LLM's accuracy degrades across different English varieties).
ENTITY
2 - AI
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit648
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
1. Establish rigorous data governance protocols to implement proactive pre-processing bias mitigation. This mandates the expansion and balancing of training datasets to ensure diverse cultural and demographic representation, coupled with systematic filtering for explicit bias and harmful content, thereby addressing the foundational source of bias propagation. 2. Employ advanced in-training alignment methodologies to embed fairness objectives directly into the model's learned preferences. This involves utilizing techniques such as Direct Preference Optimization (DPO) with a loss function engineered to favor unbiased completions, or reinforcement learning from human feedback (RLHF) to align model behavior with predefined ethical and fairness principles across sensitive social categories. 3. Integrate multi-faceted, real-time and post-generation mitigation mechanisms during inference. This includes implementing self-reflection and self-diagnosis architectures (e.g., Self-Bias Mitigation in the Loop) that enable the LLM to autonomously critique and adjust its own outputs for bias, and the mandatory application of sophisticated prompt engineering strategies to steer the model towards more equitable and context-aware responses.