Adversarial input
Adversarial Inputs involve modifying individual input data to cause a model to malfunction. These modifications, which are often imperceptible to humans, exploit how the model makes decisions to produce errors (Wallace et al., 2019) and can be applied to text, but also to images, audio, or video (e.g. changing pixels in an image of a panda in a way that causes a model to label it as a gibbon).6
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1263
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Adversarial Training Fortify the model's inherent resilience by incorporating adversarially perturbed examples during the training phase, thereby exposing the model to diverse attack scenarios and enhancing its capacity to correctly classify manipulated inputs. 2. Input Validation and Sanitization Implement robust input validation and filtering mechanisms as a primary defense layer, enforcing rigorous checks and sanitization on all incoming data and queries to identify and block subtle malicious perturbations before they are processed by the GenAI system. 3. Real-Time Monitoring and Anomaly Detection Establish continuous, real-time monitoring of input and output streams, utilizing behavioral analytics and statistical baselining to immediately detect unusual query patterns or content irregularities that signify an ongoing or novel adversarial attack attempt.