Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Problems of synthetic data

In the case of sparse data quantity, the simulation or generation of data is a valid alternative. However, it is essential to make sure that the simulated data is sufficiently similar to real data, especially in the way the AI system perceives them. Otherwise, generalization to operational data and reliable operational behavior can not be guaranteed.

Source: MIT AI Risk Repositorymit1006

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit1006

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Implement a multi-metric statistical fidelity and utility validation framework prior to deployment. This framework must quantitatively assess the similarity between synthetic and real-world operational data distributions using metrics such as the Kolmogorov-Smirnov test, Kullback–Leibler divergence, and correlation matrix preservation (e.g., Frobenius norm). 2. Mandate the "Train on Synthetic, Test on Real" (TSTR) methodology during model development. The model trained exclusively on synthetic data must demonstrate comparable predictive performance, typically with performance metrics (e.g., F1-score, accuracy) remaining within a minimal, pre-defined deviation from a model trained on real-world data when both are tested against a real, operational hold-out set. 3. Establish a continuous domain-expert review and refinement loop for data realism. This process requires collaboration with domain specialists (e.g., clinicians, engineers) to perform red-teaming and automated consistency checks, ensuring the synthetic data accurately reflects logical constraints, temporal validity, and critical, rare operational scenarios that statistical similarity checks may overlook.