1. Discrimination & Toxicity3 - Other

Biases are not accurately reflected in explanations

Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model.

Source: MIT AI Risk Repositorymit1133

ENTITY

3 - Other

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit1133

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Implement rigorous and systematic fairness testing using statistical metrics (e.g., equalized odds, demographic parity) across all relevant demographic and protected subgroups to establish the model's ground-truth accuracy and parity, thereby reducing reliance on potentially unfaithful post-hoc explanations for primary bias detection. 2. Employ adversarial and counterfactual testing to actively challenge the explanation mechanism's fidelity and stability, specifically crafting inputs to elicit and expose hidden or 'cherry-picked' sensitive attributes (e.g., race, gender) in the model's decision-making process, ensuring explanations truthfully reflect the model's internal logic. 3. Institute a mandatory Human-in-the-Loop (HITL) review process, staffed by a diverse panel of domain experts and ethicists, with the explicit mandate to scrutinize high-stakes decisions, question misleading explanations, and override outcomes suspected of being influenced by obscured bias, establishing a critical accountability layer.