1. Discrimination & Toxicity1 - Pre-deployment

Data bias

Specifically, data bias refers to certain groups or certain types of elements that are over-weighted or over-represented than others in AI/ ML models, or variables that are crucial to characterize a phenomenon of interest, but are not properly captured by the learned models.

Source: MIT AI Risk Repositorymit333

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit333

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Prioritize the collection of diverse and representative data: Ensure the training corpus encompasses a broad spectrum of demographic, contextual, and environmental conditions to accurately reflect the target real-world population and mitigate selection bias at the data source level. 2. Implement data pre-processing techniques for imbalance correction: Apply methods such as reweighting (assigning differential importance to samples) or explicit resampling (e.g., oversampling minority classes, generating synthetic data) to achieve distributional balance and reduce the overrepresentation of specific groups within the training dataset. 3. Conduct systematic analysis and modification of proxy variables: Identify and assess features that exhibit high statistical correlation with protected attributes, and subsequently remove, reweight, or transform these proxy variables to prevent the algorithm from indirectly utilizing them to perpetuate systemic discrimination.