Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Unrepresentative data

Unrepresentative data occurs when the training or fine-tuning data is not sufficiently representative of the underlying population or does not measure the phenomenon of interest.

Source: MIT AI Risk Repositorymit1282

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1282

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Prioritize Representative Data Sourcing and Collection Implement robust strategies to ensure training and fine-tuning data is sufficiently representative of the target population and real-world variability. This requires actively diversifying data sources, ensuring comprehensive demographic and socio-economic representation, and establishing clear data collection standards to proactively prevent selection bias. 2. Establish Rigorous Bias Detection and Auditing Frameworks Integrate continuous monitoring, statistical testing (e.g., disparate impact analysis and fairness metrics), and regular dataset audits throughout the AI lifecycle. Employ specialized AI fairness toolkits to systematically identify, measure, and track the presence of systemic and historical biases within the data and model outputs across different subgroups. 3. Employ Technical Data Balancing and Remediation Techniques Utilize methods such as stratified sampling, reweighting data points, or appropriate oversampling and undersampling to correct for skewed distributions and address the underrepresentation of specific attributes or demographics within the training corpus, thereby enhancing the model's fairness and generalizability.