Training-related (Robust overfitting in adversarial training)
Adversarial training can be affected by robust overfitting, where the model’s robustness on test data decreases during further training, particularly after the learning rate decay. This issue has been consistently observed across various datasets and algorithms in adversarial training settings [163, 230]. Robust over- fitting can affect the model’s ability to generalize effectively and reduce its resilience to adversarial attacks.
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1099
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Implement rigorous Early Stopping to halt the adversarial training process immediately following the cessation of improvement in robust test accuracy, as this is a demonstrably effective method for preventing the decline in generalization associated with robust overfitting. 2. Employ Model Smoothening Techniques such as Stochastic Weight Averaging (SWA) to promote the discovery of flatter minima, or Knowledge Distillation (KD) with self-training to smooth the network's logits, thereby injecting learned regularization to mitigate the phenomenon. 3. Apply Targeted Optimization and Regularization strategies, including Consistency Regularization or a layer-wise adversarial training approach that specifically regularizes the optimization of the latter layers of the deep neural network, which have been implicated in the onset of robust overfitting.