Compromising privacy by leaking private infiormation
By providing true information about individuals’ personal characteristics, privacy violations may occur. This may stem from the model “remembering” private information present in training data (Carlini et al., 2021).
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit237
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
1. **Differential Privacy (DP) Implementation** Implement Differential Privacy (DP), specifically utilizing techniques such as DP-SGD with gradient clipping and noise injection during the (pre-)training and fine-tuning phases. This rigorous, mathematically-grounded approach provides provable guarantees against the verbatim or semantic regurgitation of individual data points from the training corpus, thereby preventing data memorization and associated leakage. 2. **Source Data Minimization and Anonymization** Enforce stringent data minimization and pre-processing protocols, including the thorough anonymization, pseudonymization, or deterministic tokenization of all Personally Identifiable Information (PII) and highly sensitive data within the training and fine-tuning datasets. This foundational, preventative measure ensures that the model is not exposed to raw, linkable private information, reducing the surface area for a privacy violation. 3. **Real-time Output Monitoring and Redaction** Deploy continuous, real-time output monitoring and filtering mechanisms to automatically scan all generated text for patterns matching sensitive or proprietary data (e.g., PII, PHI, credentials). This critical runtime safeguard serves as a final defense layer to detect and redact unauthorized disclosures before the LLM's response is presented to the user, mitigating inadvertent data leakage during inference.
ADDITIONAL EVIDENCE
Current large-scale LMs rely on training datasets that contain information about people. Privacy violations may occur when training data includes personal information that is then directly disclosed by the model (Carlini et al., 2021). Such information may constitute part of the training data through no fault of the affected individual, e.g. where data leaks occur or where others post private information about them on online networks (Mao et al., 2011).