Extraction Attacks
Extraction attacks [137] allow an adversary to query a black-box victim model and build a substitute model by training on the queries and responses. The substitute model could achieve almost the same performance as the victim model. While it is hard to fully replicate the capabilities of LLMs, adversaries could develop a domainspecific model that draws domain knowledge from LLMs
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit45
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement **Differential Privacy (DP)** during model training, coupled with rigorous **data deduplication and sanitization**, to fundamentally reduce the fidelity and risk of sensitive information being verifiably memorized and subsequently extracted. 2. Enforce **strict, adaptive query rate limitations (session-based limitations)** and deploy **real-time behavioral monitoring** to detect and mitigate high-volume or template-based adversarial querying patterns characteristic of model extraction and prompt stealing attacks. 3. Deploy **architectural defense mechanisms**, such as **watermarking** (e.g., within attention layers) or **output disruption frameworks** (e.g., Adversarial Fine-Tuning), designed to compromise the functional integrity or knowledge transfer of any resultant substitute model.