Biases are not accurately reflected in explanations
Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit1133
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
1. Implement rigorous and systematic fairness testing using statistical metrics (e.g., equalized odds, demographic parity) across all relevant demographic and protected subgroups to establish the model's ground-truth accuracy and parity, thereby reducing reliance on potentially unfaithful post-hoc explanations for primary bias detection. 2. Employ adversarial and counterfactual testing to actively challenge the explanation mechanism's fidelity and stability, specifically crafting inputs to elicit and expose hidden or 'cherry-picked' sensitive attributes (e.g., race, gender) in the model's decision-making process, ensuring explanations truthfully reflect the model's internal logic. 3. Institute a mandatory Human-in-the-Loop (HITL) review process, staffed by a diverse panel of domain experts and ethicists, with the explicit mandate to scrutinize high-stakes decisions, question misleading explanations, and override outcomes suspected of being influenced by obscured bias, establishing a critical accountability layer.