Benchmarking (Raw data contamination)
This type of contamination [170] occurs when the raw and unlabeled data of a benchmark is used as part of the training set. Such data may not be properly formatted and may contain noise, especially if the contamination happens before the data is pre-processed into the benchmark. If this contamination occurs, it could cast doubt on the few-shot and zero-shot performance of the model on that benchmark.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1117
Domain lineage
6. Socioeconomic and Environmental
6.5 > Governance failure
Mitigation strategy
1. Implement Strict Data Governance and Supply Chain Management Establish a rigorous data provenance framework where all training corpora are versioned, signed, and subjected to a comprehensive data bill of materials (DBOM). This foundational step ensures that evaluation benchmark data—or any derivative information—is systematically restricted and excluded from the ingestion and preprocessing pipelines for the training set (Source 20, 9). 2. Proactive and Comprehensive Deduplication and Filtering Prior to model pre-training, deploy advanced, automated data hygiene pipelines to identify and remove all forms of overlap (verbatim, partial, and approximate matches) between the web-scale corpus and all known evaluation benchmarks. This detection should leverage sophisticated techniques such as N-gram matching with aggressive thresholds or kernel divergence scoring to ensure a clean training baseline (Source 7, 10, 11). 3. Utilize Temporal Isolation for Benchmark Creation Construct new evaluation benchmarks using data sources and instances that were created or released *after* the pre-training cutoff date of the model being evaluated. This strategy effectively leverages time as a subjective criterion to create a naturally clean test set, significantly mitigating the risk of unintentional raw data contamination (Source 11, 18).