Memorization in LLMs
Memorization in LLMs refers to the capability to recover the training data with contextual prefixes. According to [88]–[90], given a PII entity x, which is memorized by a model F. Using a prompt p could force the model F to produce the entity x, where p and x exist in the training data. For instance, if the string “Have a good day!\n alice@email.com” is present in the training data, then the LLM could accurately predict Alice’s email when given the prompt “Have a good day!\n”.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit33
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
1. Implement Data Sanitization and Deduplication Systematically perform preprocessing steps to remove or redact Personally Identifiable Information (PII) and highly redundant sequences from the training corpus. Data deduplication significantly reduces the likelihood of verbatim memorization without substantial utility loss, serving as the foundational defense. 2. Employ Targeted Machine Unlearning Techniques Post-training, utilize unlearning-based methods (e.g., BalancedSubnet or circuit patching like Patch) to surgically edit or remove the specific model components (neurons or weights) responsible for encoding memorized content. This efficiently remediates existing memorization artifacts while preserving the model's performance on generalized tasks. 3. Apply Differential Privacy Mechanisms For applications dealing with extremely sensitive data, integrate Differential Privacy (DP-SGD) into the training process. This technique provides formal, mathematically rigorous guarantees against the extraction of individual training points, offering the highest level of privacy protection despite potential trade-offs in computational cost and overall model utility.