2. Privacy & Security2 - Post-deployment

Transferable adversarial attacks from open to closed-source mod- els

In some cases, an adversarial attack developed for an open-weights and open- source model (where the weights and architecture are known - a “white box” attack) can be transferable to closed-source models, despite the defenses put in place by the closed-source model provider (such as structured access). These adversarial attacks can be generated automatically [238].

Source: MIT AI Risk Repositorymit1138

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1138

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. **Advanced Adversarial Training Regimens** Implement ensemble adversarial training, augmenting the training dataset with a diverse collection of adversarial examples generated by multiple surrogate models and various attack techniques. This approach is demonstrated to enhance classifier robustness by explicitly mitigating the risk of overfitting to a single attack model, thereby strengthening defenses against black-box transfer-based attacks. 2. **Input Pre-processing and Stochastic Transformation** Deploy robust, non-differentiable stochastic transformations on all incoming inputs prior to classification. Methods such as **Randomized Discretization** or the application of diffusion-based purification models are highly effective. These techniques aim to destroy the minimal, imperceptible adversarial perturbations by introducing controlled noise or restoring the input image's original distribution, which fundamentally breaks the transferability of the attack. 3. **Architectural Gradient Obfuscation** Incorporate non-differentiable components, such as Random Forests, into the terminal layers of the General-Purpose AI (GPAI) model's architecture. This method intentionally masks or shatters the gradient information required by attackers for calculating effective white-box perturbations, thus preventing the generation of new, highly potent transferable adversarial examples against the target closed-source model.