2. Privacy & Security2 - Post-deployment

Inference Attacks

Inference attacks [150] include membership inference attacks, property inference attacks, and data reconstruction attacks. These attacks allow an adversary to infer the composition or property information of the training data. Previous works [67] have demonstrated that inference attacks could easily work in earlier PLMs, implying that LLMs are also possible to be attacked

Source: MIT AI Risk Repositorymit46

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit46

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Ensemble and Adaptive Inference Architectures Employ a novel ensemble training methodology, such as Split-AI, which partitions the training data into random subsets and utilizes an adaptive inference strategy to aggregate outputs only from sub-models that were not trained on the queried sample. This approach empirically ensures similar model behavior on members and non-members, offering a superior trade-off between membership privacy and model utility compared to provable privacy techniques. 2. Apply Diffusion-Driven Data Preprocessing (D3P) Utilize a generative defense framework, such as D3P, which leverages the synthesis capabilities of diffusion models to transform sensitive training data before model learning. This process alters the fine-grained statistical characteristics of the input data, effectively reducing the exploitable membership signals without requiring modification to the learning algorithm or network architecture. 3. Enforce Mitigations Through Training Controls and Query Monitoring Mitigate the root cause of leakage (overfitting) by applying regularization techniques to ensure the model is more generalizable and less prone to memorizing specific data records. Furthermore, reduce the attack surface for black-box adversaries by limiting the precision of prediction confidence scores and deploying a monitoring stack capable of detecting behavioral anomalies, such as bursts of near-identical queries or suspiciously long outputs that signal data reconstruction attempts.