Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Injustice

In the context of LLM outputs, we want to make sure the suggested or completed texts are indistinguishable in nature for two involved individuals (in the prompt) with the same relevant profiles but might come from different groups (where the group attribute is regarded as being irrelevant in this context)

Source: MIT AI Risk Repositorymit488

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit488

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. **Internal Bias Neutralization via Activation Editing** Implement internal bias mitigation techniques, such as Affine Concept Editing or Fairness Pruning, to identify and neutralize the specific vector directions within the LLM's activation space (e.g., MLP layers) that correlate with protected attributes. This intervention, applied at inference time or through model optimization, robustly decouples the model's decision-making from irrelevant demographic cues to ensure output indistinguishability. 2. **Self-Correction and Multi-Agent Debate Frameworks** Deploy a post-processing or inference-time framework that mandates critical self-reflection (Self-BMIL) or cooperative debate (Coop-BMIL) on the model's initial output. The model, or a set of agents, must autonomously assess the response for impartiality, specifically checking if the output relies on group-based stereotypes rather than relevant profile information, and then adjust the final generation to eliminate the detected bias. 3. **Stringent Counterfactual Data Augmentation and Filtering** Enforce a comprehensive data-level strategy by applying rigorous filtering and **Counterfactual Data Augmentation** on the training and fine-tuning corpora. This involves systematically generating or modifying data samples to disrupt and balance stereotypical associations, ensuring that the model's foundational knowledge is not biased toward specific demographic groups when predicting outcomes for similar, relevant profiles.

ADDITIONAL EVIDENCE

One of the prominent considerations of justice is impartiality [226]. Impartiality refers to the requirement that “similar individuals should be treated similarly” by the model. It resembles similarity to the individual fairness concept of fairness in machine learning literature