2. Privacy & Security2 - Post-deployment

Extraction attack

An attribute inference attack is used to detect whether certain sensitive features can be inferred about individuals who participated in training a model. These attacks occur when an adversary has some prior knowledge about the training data and uses that knowledge to infer the sensitive data.

Source: MIT AI Risk Repositorymit1287

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1287

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Data Minimization and Differential Privacy Implementation Institute a mandatory data minimization policy, ensuring that the machine learning model is trained only on strictly necessary non-sensitive features. Prior to training, apply formal privacy-preserving techniques, specifically **Differential Privacy** (DP), to the training data or the learning process. DP adds calibrated noise to obscure individual record contributions, mathematically guaranteeing a quantifiable level of privacy protection against attribute inference attacks while striving to maintain acceptable model utility. 2. Post-Training Attribute Unlearning and Embedding Purification Implement post-training defense mechanisms aimed at reducing information leakage from the final model artifact. Utilize frameworks for **Attribute Unlearning (AU)** (e.g., AttrCloak) to fine-tune the model (e.g., user embeddings in recommender systems) to minimize the mutual information between the internal representations and the sensitive attributes, without necessitating full model retraining or significant architectural modifications. 3. Adversarial Defenses and Output Obfuscation Deploy active defenses that manipulate the model's output or input to frustrate the adversary's efforts. This includes integrating **score masking strategies** or **adversarial machine learning defenses** (e.g., AttriGuard) that strategically perturb the model's confidence scores or the input data to render the resulting attribute inference predictions unreliable (e.g., reducing accuracy to a random baseline). Furthermore, enforce strict **query rate limits** at the API level to mitigate the feasibility of high-volume black-box probing attacks.