Adversarial attacks targeting explainable AI techniques
Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1132
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Prioritize the implementation of Adversarial Training on Explanations (ATEX) or similar robust optimization methods during model development to intrinsically improve the stability and resilience of explanation outputs against perturbations designed to manipulate feature attribution while maintaining prediction consistency. 2. Deploy real-time explanation fidelity detectors that monitor the logical consistency and faithfulness of the generated explanations. This involves verifying explanation reliability through principled techniques, such as analyzing the drop in model confidence upon masking highly attributed input features, to detect manipulated or unfaithful attributions. 3. Establish a defense-in-depth strategy by restricting model output granularity and limiting API query rates to deter model extraction and reverse-engineering of the explanation mechanism. Additionally, favor the use of explanation aggregation methods across diverse XAI algorithms to reduce the attack surface targeting a singular interpretability technique.