Problems of synthetic data
In the case of sparse data quantity, the simulation or generation of data is a valid alternative. However, it is essential to make sure that the simulated data is sufficiently similar to real data, especially in the way the AI system perceives them. Otherwise, generalization to operational data and reliable operational behavior can not be guaranteed.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit1006
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Implement a multi-metric statistical fidelity and utility validation framework prior to deployment. This framework must quantitatively assess the similarity between synthetic and real-world operational data distributions using metrics such as the Kolmogorov-Smirnov test, Kullback–Leibler divergence, and correlation matrix preservation (e.g., Frobenius norm). 2. Mandate the "Train on Synthetic, Test on Real" (TSTR) methodology during model development. The model trained exclusively on synthetic data must demonstrate comparable predictive performance, typically with performance metrics (e.g., F1-score, accuracy) remaining within a minimal, pre-defined deviation from a model trained on real-world data when both are tested against a real, operational hold-out set. 3. Establish a continuous domain-expert review and refinement loop for data realism. This process requires collaboration with domain specialists (e.g., clinicians, engineers) to perform red-teaming and automated consistency checks, ensuring the synthetic data accurately reflects logical constraints, temporal validity, and critical, rare operational scenarios that statistical similarity checks may overlook.