Data-related (Manipulation of data by non-domain experts)
Manipulating data (e.g., training data) carries a set of assumptions on how the data should appear and be used by those performing the manipulation. Common manipulations applied on data in the context of AI models include defining the ground truth label and merging different data formats or sources. People who have little or no expertise in the domain of the data performing such manipulations may render the data unusable or harmful to the development of the AI system [173].
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1096
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Mandatory Domain Expert Integration and Oversight: Establish a formal governance framework requiring the continuous engagement of Subject Matter Experts (SMEs) to define ground truth labels, validate data provenance, and guide all data transformation and merging operations. This ensures that data manipulation processes maintain semantic integrity and are aligned with domain-specific principles, directly mitigating the risk introduced by non-expert assumptions. 2. Implement Automated and Auditable Data Quality Gateways: Deploy rigorous data quality controls, including automated validation pipelines, data profiling, and statistical anomaly detection, to operate as mandatory gateways before data is consumed for model training. The system must generate immutable audit logs for all data modification and validation activities, providing traceability and ensuring that corrupted or unusable datasets are flagged and prevented from entering the AI development environment. 3. Institute Granular Role-Based Access Controls and Specialized Training: Enforce a principle of least privilege through Role-Based Access Controls (RBAC) that limits data manipulation permissions solely to qualified data stewards and domain experts. This must be complemented by mandatory, specialized training programs focused on responsible data management, domain-specific data protocols, and the potential for unintentional bias introduced through mislabeling or flawed merging techniques.