Back to the MIT repository
2. Privacy & Security1 - Pre-deployment

Confidential information in data

Confidential information might be included as part of the data that is used to train or tune the model.

Source: MIT AI Risk Repositorymit1280

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1280

Domain lineage

2. Privacy & Security

186 mapped risks

2.1 > Compromise of privacy by leaking or correctly inferring sensitive information

Mitigation strategy

1. Prioritize the application of data minimization and privacy-preserving techniques to the training dataset. This includes implementing robust pseudonymization, data masking, or synthetic data generation to irreversibly remove or obfuscate personally identifiable information (PII) and confidential company data before model ingestion, thereby significantly reducing the risk of accidental exposure or extraction. 2. Establish stringent access controls (e.g., Role-Based Access Control - RBAC) and network isolation for all training environments and data storage. Access to the raw training data and the computing infrastructure must be strictly limited to authorized personnel via the principle of least privilege, and the training environment must be logically or physically segmented from corporate networks to prevent unauthorized data exfiltration. 3. Mandate end-to-end data encryption for all sensitive training data. This requires applying strong cryptographic standards (e.g., AES-256) to data at rest (in storage) and in transit (during movement to/from the training cluster) to ensure the information remains unreadable and unusable by any unauthorized party in the event of a system compromise or data breach.