2. Privacy & Security2 - Post-deployment

Training-related (Adversarial examples)

Adversarial examples [198, 83] refer to data that are designed to fool an AI model by inducing unintended behavior. They do this by exploiting spurious correlations learned by the model. They are part of inference-time attacks, where the examples are test examples. They generalize to different model architectures and models trained on different training sets.

Source: MIT AI Risk Repositorymit1098

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1098

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Prioritize **Adversarial Training** (also known as Adversarial Learning): This involves augmenting the model's training dataset with carefully crafted adversarial examples. This technique exposes the model to manipulated inputs during the training phase, fundamentally enhancing its intrinsic robustness and resilience against evasion attacks. 2. Implement **Differential Privacy** and **Output Obfuscation**: To mitigate inference-time attacks, such as model inversion and membership inference, employ differential privacy techniques to protect individual data points. Additionally, reduce the granularity of model outputs (e.g., returning only class labels instead of probabilities) to deny adversaries the requisite data for effective reverse-engineering or crafting precise attacks. 3. Employ **Robust Feature Extraction** and **Data Validation Pipelines**: Utilize robust feature extraction methods to isolate meaningful patterns in input data while minimizing the influence of irrelevant or misleading information (noise), thereby fortifying predictions. Furthermore, automate data validation and sanitization checks on all inference inputs to detect and eliminate subtle, malicious perturbations before they can compromise the model's prediction integrity.