Confidential information in data
Confidential information might be included as part of the data that is used to train or tune the model.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1280
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
1. Prioritize the application of data minimization and privacy-preserving techniques to the training dataset. This includes implementing robust pseudonymization, data masking, or synthetic data generation to irreversibly remove or obfuscate personally identifiable information (PII) and confidential company data before model ingestion, thereby significantly reducing the risk of accidental exposure or extraction. 2. Establish stringent access controls (e.g., Role-Based Access Control - RBAC) and network isolation for all training environments and data storage. Access to the raw training data and the computing infrastructure must be strictly limited to authorized personnel via the principle of least privilege, and the training environment must be logically or physically segmented from corporate networks to prevent unauthorized data exfiltration. 3. Mandate end-to-end data encryption for all sensitive training data. This requires applying strong cryptographic standards (e.g., AES-256) to data at rest (in storage) and in transit (during movement to/from the training cluster) to ensure the information remains unreadable and unusable by any unauthorized party in the event of a system compromise or data breach.