2. Privacy & Security3 - Other

Adversarial attacks targeting explainable AI techniques

Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output.

Source: MIT AI Risk Repositorymit1132

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1132

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Prioritize the implementation of Adversarial Training on Explanations (ATEX) or similar robust optimization methods during model development to intrinsically improve the stability and resilience of explanation outputs against perturbations designed to manipulate feature attribution while maintaining prediction consistency. 2. Deploy real-time explanation fidelity detectors that monitor the logical consistency and faithfulness of the generated explanations. This involves verifying explanation reliability through principled techniques, such as analyzing the drop in model confidence upon masking highly attributed input features, to detect manipulated or unfaithful attributions. 3. Establish a defense-in-depth strategy by restricting model output granularity and limiting API query rates to deter model extraction and reverse-engineering of the explanation mechanism. Additionally, favor the use of explanation aggregation methods across diverse XAI algorithms to reduce the attack surface targeting a singular interpretability technique.