Training-related (Robustness certificates can be exploited to attack the models)
The knowledge of robustness certificates, including the area of the region for which model predictions are certified to be robust, can be used by an adversary to efficiently craft attacks that succeed just outside the certified regions [53].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1100
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement strict information governance to ensure that exact certified radii and the internal parameters of the certification mechanism are treated as confidential, non-public information, thereby depriving adversaries of the precise knowledge required to efficiently craft attacks that succeed just outside the provable robust region. 2. Adopt advanced certified defense methodologies, such as Randomized Smoothing or formal verification techniques, to maximize the provable certified radius ($r^\*$) and ensure the certified bounds are as tight as possible, thereby minimizing the exploitable gap between the theoretical and empirical robustness boundaries. 3. Integrate run-time monitoring to couple the certified prediction with a mechanism for detecting inputs that fall just outside the certified region, enabling the system to reject the prediction or engage a fallback mechanism (e.g., human review) when a high-confidence, but uncertified, prediction occurs in close proximity to the certified boundary.