Incorrect data labels
Data labels are essential for any supervised learning algorithm since they preset the result of the learning process. If the correctness of the data labels is not given, the AI system is prevented from learning the ground truth and therefore the intended functionality.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1003
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Implement a Multi-Layered Data Governance and Quality Assurance Framework Establish comprehensive and unambiguous annotation ontologies, mandate inter-rater reliability analysis (consensus scoring) during the labeling process, and integrate continuous data validation by domain experts. This preventative measure ensures high-fidelity "ground truth" and minimizes the introduction of label noise at the source. 2. Employ Noise-Tolerant Training Methodologies Utilize machine learning techniques inherently robust to label noise, such as generalized or noise-tolerant loss functions (e.g., Generalized Cross Entropy or Mean Absolute Error), and regularization strategies like label smoothing. These methods reduce the model's capacity to memorize incorrect labels, thereby enhancing generalization performance. 3. Integrate Automated Label Error Detection and Correction Systems Deploy model-based techniques, such as Confident Learning or analysis of loss dynamics, to systematically identify training instances with high likelihood of being mislabeled. These flagged samples must then be routed for targeted human expert review and subsequent iterative correction (label cleaning) to maintain dataset integrity throughout the model lifecycle.