Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Insufficient data representation

The distribution of the data used for training a model should match the operational data ́s distribution while consisting of sufficiently many samples. An important aspect of matching distributions between training and operational data is that also data which is rarely confronting the AI system in operation is represented in the training data.

Source: MIT AI Risk Repositorymit1005

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit1005

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Systematically augment the training dataset, employing techniques such as targeted collection of real-world samples or the generation of synthetic data (e.g., using SMOTE or Generative Adversarial Networks) to ensure statistical parity between the training and expected operational data distributions, especially for minority classes and rare events. 2. Implement fairness-aware algorithmic adjustments, such as modifying the model's loss function (e.g., MinDiff, Counterfactual Logit Pairing) or applying instance re-weighting, to directly penalize performance discrepancies or classification errors tied to sensitive attributes or under-represented data slices during the training phase. 3. Establish a robust, continuous monitoring system to track key performance indicators and feature distributions in the operational environment, enabling the early detection of data drift or model decay which signifies emerging under-representation, thereby triggering scheduled or event-driven model re-training with the newly available, representative data.