2. Privacy & Security2 - Post-deployment

Adversarial AI (General)

Adversarial AI refers to a class of attacks that exploit vulnerabilities in machine-learning (ML) models. This class of misuse exploits vulnerabilities introduced by the AI assistant itself and is a form of misuse that can enable malicious entities to exploit privacy vulnerabilities and evade the model’s built-in safety mechanisms, policies, and ethical boundaries of the model. Besides the risks of misuse for offensive cyber operations, advanced AI assistants may also represent a new target for abuse, where bad actors exploit the AI systems themselves and use them to cause harm. While our understanding of vulnerabilities in frontier AI models is still an open research problem, commercial firms and researchers have already documented attacks that exploit vulnerabilities that are unique to AI and involve evasion, data poisoning, model replication, and exploiting traditional software flaws to deceive, manipulate, compromise, and render AI systems ineffective. This threat is related to, but distinct from, traditional cyber activities. Unlike traditional cyberattacks that typically are caused by ‘bugs’ or human mistakes in code, adversarial AI attacks are enabled by inherent vulnerabilities in the underlying AI algorithms and how they integrate into existing software ecosystems.

Source: MIT AI Risk Repositorymit381

ENTITY

3 - Other

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit381

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. **Implement Adversarial Training and Robustness Testing** Proactively fortify the model's resilience by incorporating adversarial examples into the training process and conducting regular robustness testing using red team methodologies. This defense mechanism, which includes techniques such as Defensive Distillation, explicitly aims to harden the model's decision boundaries against evasion and poisoning attacks. 2. **Enforce Layered Input Validation and Sanitization** Deploy rigorous input validation and preprocessing systems to detect and filter potentially malicious inputs, such as those used in prompt injection or evasion attacks, before they reach the model. This includes implementing statistical anomaly detection and input sanitization to normalize data and minimize the influence of irrelevant or misleading perturbations. 3. **Establish Continuous Behavioral Monitoring and Zero-Trust Architecture** Institute real-time monitoring of AI system behavior to detect deviations from established performance baselines or suspicious decision patterns indicative of compromise. Furthermore, apply zero-trust principles to AI systems by limiting query rates and restricting output granularity to reduce information leakage and deter model extraction attacks.