Data-related (Insufficient quality control in data collection process)
A lack of standardized methods and sufficient infrastructure, including the absence of quality control processes for collecting data, especially for high-stakes domains and benchmarks, can affect the quality and type of the data collected [173, 95]. This may include risks of dataset poisoning, inadvertent copyright violation, and test set leakages which invalidate performance metrics.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1097
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. **Establish a Formal Data Governance and Standardization Framework** Implement a mandatory Data Governance Framework that defines and enforces uniform standards for data collection, formatting, and quality dimensions (e.g., accuracy, completeness, consistency) across all data sources. This framework must clearly delineate roles (e.g., Data Stewards), responsibilities, and escalation paths, and mandate comprehensive data literacy training for all personnel to ensure consistent application of collection protocols, thereby addressing the foundational lack of standardized methods. 2. **Automate Continuous Data Quality Monitoring and Validation** Deploy sophisticated, automated data quality solutions that perform continuous real-time profiling, validation, and anomaly detection during the data collection and ingestion phases. These tools must employ both rule-based and machine learning-powered techniques to flag inconsistencies, identify statistical outliers, and detect patterns indicative of dataset poisoning attempts, allowing for immediate quarantine and remediation before data enters the training environment. 3. **Enforce Strict Data Lineage, Access Control, and Split Integrity** Institute rigorous data provenance tracking (lineage) to maintain an immutable record of the origin, transformations, and access history of all training data, which supports auditability for copyright compliance and forensic analysis in data poisoning incidents. Concurrently, apply the principle of least privilege (POLP) for access control, and utilize technically correct data splitting methodologies (e.g., time-based or grouped splits *before* preprocessing) to eliminate the risk of test set leakage.