Back to the MIT repository
2. Privacy & Security2 - Post-deployment

Extraction Attacks

Extraction attacks [137] allow an adversary to query a black-box victim model and build a substitute model by training on the queries and responses. The substitute model could achieve almost the same performance as the victim model. While it is hard to fully replicate the capabilities of LLMs, adversaries could develop a domainspecific model that draws domain knowledge from LLMs

Source: MIT AI Risk Repositorymit45

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit45

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement **Differential Privacy (DP)** during model training, coupled with rigorous **data deduplication and sanitization**, to fundamentally reduce the fidelity and risk of sensitive information being verifiably memorized and subsequently extracted. 2. Enforce **strict, adaptive query rate limitations (session-based limitations)** and deploy **real-time behavioral monitoring** to detect and mitigate high-volume or template-based adversarial querying patterns characteristic of model extraction and prompt stealing attacks. 3. Deploy **architectural defense mechanisms**, such as **watermarking** (e.g., within attention layers) or **output disruption frameworks** (e.g., Adversarial Fine-Tuning), designed to compromise the functional integrity or knowledge transfer of any resultant substitute model.