7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Inappropriate data splitting

In data-driven AI development, the annotated data set is commonly split into training, validation, and test sets, whereby it is essential that the latter is not used for development but only for evaluation. Using the test set for training manipulates the testing strategy, which is the basis of the system’s quality assurance.

Source: MIT AI Risk Repositorymit1007

ENTITY

1 - Human

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit1007

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.0 > AI system safety, failures, & limitations

Mitigation strategy

1. Enforce Strict Procedural and Technical Separation of Data Subsets Institute rigorous governance and access controls to ensure the designated holdout test set is used *exclusively* for final, unbiased model evaluation. This control is paramount to preventing test set contamination, which fundamentally compromises the system's quality assurance metrics and leads to an overestimation of real-world performance. The test set must be isolated from the training, validation, and hyperparameter tuning processes. 2. Employ Structurally Appropriate Data Splitting Methodologies Select a data partitioning strategy—such as time-based splits for temporal data or grouped splits for inherent entity structures (e.g., users, sessions)—that aligns precisely with the data's dependencies and the inference-time constraints. All data transformation and feature engineering steps must be fit *only* on the training set and subsequently applied to the test set to avoid unintentionally bleeding information across partitions. 3. Institute Automated Split Validation and Auditing Mechanisms Implement automated validation tests and audit tools within the MLOps pipeline to continuously inspect for signs of data leakage. Key checks include validating that no overlapping records or identifiers exist between sets, monitoring data lineage, and conducting differential testing (such as ablation studies) to detect anomalous performance gains that may indicate hidden leakage patterns.