Data bias
Specifically, data bias refers to certain groups or certain types of elements that are over-weighted or over-represented than others in AI/ ML models, or variables that are crucial to characterize a phenomenon of interest, but are not properly captured by the learned models.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit333
Domain lineage
1. Discrimination & Toxicity
1.1 > Unfair discrimination and misrepresentation
Mitigation strategy
1. Prioritize the collection of diverse and representative data: Ensure the training corpus encompasses a broad spectrum of demographic, contextual, and environmental conditions to accurately reflect the target real-world population and mitigate selection bias at the data source level. 2. Implement data pre-processing techniques for imbalance correction: Apply methods such as reweighting (assigning differential importance to samples) or explicit resampling (e.g., oversampling minority classes, generating synthetic data) to achieve distributional balance and reduce the overrepresentation of specific groups within the training dataset. 3. Conduct systematic analysis and modification of proxy variables: Identify and assess features that exhibit high statistical correlation with protected attributes, and subsequently remove, reweight, or transform these proxy variables to prevent the algorithm from indirectly utilizing them to perpetuate systemic discrimination.