2. Privacy & Security2 - Post-deployment

Compromising privacy or security by correctly inferring sensitive information

Anticipated risk: Privacy violations may occur at inference time even without an individual’s data being present in the training corpus. Insofar as LMs can be used to improve the accuracy of inferences on protected traits such as the sexual orientation, gender, or religiousness of the person providing the input prompt, they may facilitate the creation of detailed profiles of individuals comprising true and sensitive information without the knowledge or consent of the individual.

Source: MIT AI Risk Repositorymit212

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit212

Domain lineage

2. Privacy & Security

186 mapped risks

2.1 > Compromise of privacy by leaking or correctly inferring sensitive information

Mitigation strategy

1. Implement inference-time constrained decoding with logit masking, leveraging regex-aware or advanced pattern detection over a rolling window of generated text to prevent the token-level generation of sensitive or personally identifiable information (PII). This mechanism provides provable prevention guarantees by blocking the output of patterns associated with sensitive data. 2. Enforce a robust privacy-preserving training regime, such as Differential Privacy (DP), to limit the model's capacity to memorize and, crucially, infer individual attributes from input text, thereby minimizing the privacy loss parameter epsilon ($\\epsilon$) for stronger guarantees against attribute inference attacks. 3. Deploy a multi-layered post-deployment defense framework incorporating both advanced input sanitization and output filtering. This includes applying token-level redaction on user input to remove contextual clues that could facilitate inference, and utilizing entropy-based or pattern-matching content filtering on the model's final response to suppress the inadvertent disclosure of sensitive, low-entropy sequences.

ADDITIONAL EVIDENCE

Example: Language utterances (e.g. Tweets) are already being analysed to predict private information such as political ori- entation [121, 144], age [131, 135], and health data such as addiction relapses [63].