Insufficient data representation
The distribution of the data used for training a model should match the operational data ́s distribution while consisting of sufficiently many samples. An important aspect of matching distributions between training and operational data is that also data which is rarely confronting the AI system in operation is represented in the training data.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit1005
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Systematically augment the training dataset, employing techniques such as targeted collection of real-world samples or the generation of synthetic data (e.g., using SMOTE or Generative Adversarial Networks) to ensure statistical parity between the training and expected operational data distributions, especially for minority classes and rare events. 2. Implement fairness-aware algorithmic adjustments, such as modifying the model's loss function (e.g., MinDiff, Counterfactual Logit Pairing) or applying instance re-weighting, to directly penalize performance discrepancies or classification errors tied to sensitive attributes or under-represented data slices during the training phase. 3. Establish a robust, continuous monitoring system to track key performance indicators and feature distributions in the operational environment, enabling the early detection of data drift or model decay which signifies emerging under-representation, thereby triggering scheduled or event-driven model re-training with the newly available, representative data.