7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Training-related (Robust overfitting in adversarial training)

Adversarial training can be affected by robust overfitting, where the model’s robustness on test data decreases during further training, particularly after the learning rate decay. This issue has been consistently observed across various datasets and algorithms in adversarial training settings [163, 230]. Robust over- fitting can affect the model’s ability to generalize effectively and reduce its resilience to adversarial attacks.

Source: MIT AI Risk Repositorymit1099

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1099

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Implement rigorous Early Stopping to halt the adversarial training process immediately following the cessation of improvement in robust test accuracy, as this is a demonstrably effective method for preventing the decline in generalization associated with robust overfitting. 2. Employ Model Smoothening Techniques such as Stochastic Weight Averaging (SWA) to promote the discovery of flatter minima, or Knowledge Distillation (KD) with self-training to smooth the network's logits, thereby injecting learned regularization to mitigate the phenomenon. 3. Apply Targeted Optimization and Regularization strategies, including Consistency Regularization or a layer-wise adversarial training approach that specifically regularizes the optimization of the latter layers of the deep neural network, which have been implicated in the onset of robust overfitting.