2. Privacy & Security2 - Post-deployment

Adversarial input

Adversarial Inputs involve modifying individual input data to cause a model to malfunction. These modifications, which are often imperceptible to humans, exploit how the model makes decisions to produce errors (Wallace et al., 2019) and can be applied to text, but also to images, audio, or video (e.g. changing pixels in an image of a panda in a way that causes a model to label it as a gibbon).6

Source: MIT AI Risk Repositorymit1263

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1263

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Adversarial Training Fortify the model's inherent resilience by incorporating adversarially perturbed examples during the training phase, thereby exposing the model to diverse attack scenarios and enhancing its capacity to correctly classify manipulated inputs. 2. Input Validation and Sanitization Implement robust input validation and filtering mechanisms as a primary defense layer, enforcing rigorous checks and sanitization on all incoming data and queries to identify and block subtle malicious perturbations before they are processed by the GenAI system. 3. Real-Time Monitoring and Anomaly Detection Establish continuous, real-time monitoring of input and output streams, utilizing behavioral analytics and statistical baselining to immediately detect unusual query patterns or content irregularities that signify an ongoing or novel adversarial attack attempt.