Benchmarking (Cross-lingual data contamination)
Models that have been trained on data encoded in multiple languages, such as LLMs trained on web-crawled data, may contain contamination that is obscured by translation [226]. The most basic form of this is when a benchmark is trans- lated to another language and then fed to the model as training data. The fact that the benchmark is translated before becoming training data can obscure the contamination from detection methods, giving false assurance that the model has generalized on the capabilities that the benchmark tests for.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1118
Domain lineage
6. Socioeconomic and Environmental
6.5 > Governance failure
Mitigation strategy
1. Prioritize Generalization-Centric Evaluation MethodologiesImplement evaluation frameworks that rigorously test a model's true generalization capacity rather than its ability to recall memorized, non-original content. This involves employing dynamic benchmarking, where test samples undergo semantic-preserving but deep structural transformation (e.g., application or analysis extension) to shift cognitive demand and render shallow memorization ineffective. Furthermore, developing and using private, non-public benchmarks is essential to prevent any form of ingestion, accidental or adversarial, into training datasets.2. Institute Strict Temporal and Dynamic Benchmark RefreshmentEstablish a policy of continuous evaluation data creation and update, ensuring that test sets are derived from sources published after the established temporal cutoff of the model's training corpus. This "latest-materials-only" approach inherently mitigates the risk of contamination, particularly for web-crawled data, and should be paired with proactive monitoring of linguistic and cross-lingual equivalence to prevent the inadvertent use of translated or derivative content that may have existed in the training data.3. Enhance Multi-Modal and Cross-Lingual Data Filtering for Pre-training CorporaDevelop and apply advanced data curation pipelines capable of detecting and removing transformed or derivative forms of benchmark content, specifically including translated versions, from the pre-training data. This necessitates moving beyond basic n-gram overlap checks to employ sophisticated semantic matching, perplexity-based anomaly detection, or generalization-based contamination detection on the training corpus itself, thereby eliminating the contamination source before it is encoded by the model.