Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Data contamination

Data contamination occurs when incorrect data is used for training. For example, data that is not aligned with model’s purpose or data that is already set aside for other development tasks such as testing and evaluation.

Source: MIT AI Risk Repositorymit1281

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1281

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Establish a comprehensive data governance framework for all training and evaluation datasets, prioritizing rigorous statistical validation, schema consistency checks, and real-time anomaly detection to preemptively identify and filter corrupted, mislabeled, or adversarial samples before model ingestion. 2. Enforce strict and verifiable segregation between training and evaluation datasets, utilizing techniques such as time-based or grouped data splits, hash-based filtering, and encryption protocols (e.g., public-key encryption with "No Derivatives" licensing) for evaluation data to prevent accidental or non-adversarial leakage. 3. Integrate adversarial training and other robustness-enhancing mechanisms (e.g., trimmed loss functions, data reweighting, and influence-based auditing) into the model training pipeline to mitigate the impact of stealthy data poisoning attacks on model integrity and accuracy.