2. Privacy & Security2 - Post-deployment

Adversarial AI: Circumvention of Technical Security Measures

The technical measures to mitigate misuse risks of advanced AI assistants themselves represent a new target for attack. An emerging form of misuse of general-purpose advanced AI assistants exploits vulnerabilities in a model that results in unwanted behavior or in the ability of an attacker to gain unauthorized access to the model and/or its capabilities. While these attacks currently require some level of prompt engineering knowledge and are often patched by developers, bad actors may develop their own adversarial AI agents that are explicitly trained to discover new vulnerabilities that allow them to evade built-in safety mechanisms in AI assistants. To combat such misuse, language model developers are continually engaged in a cyber arms race to devise advanced filtering algorithms capable of identifying attempts to bypass filters. While the impact and severity of this class of attacks is still somewhat limited by the fact that current AI assistants are primarily text-based chatbots, advanced AI assistants are likely to open the door to multimodal inputs and higher-stakes action spaces, with the result that the severity and impact of this type of attack is likely to increase. Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress towards advanced AI assistant development could lead to capabilities that pose extreme risks that must be protected against this class of attacks, such as offensive cyber capabilities or strong manipulation skills, and weapons acquisition.

Source: MIT AI Risk Repositorymit382

ENTITY

3 - Other

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit382

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement Robust Model Hardening via Adversarial Training: Fortify the foundational model by employing Adversarial Training, which involves augmenting the training dataset with adversarial examples to enhance the model's intrinsic robustness. This technique improves resilience against evasion attacks, such as prompt injection and jailbreaking, by hardening the model's decision-making logic against subtle input perturbations. 2. Establish Continuous, Systematic AI Red Teaming: Institute an ongoing security lifecycle that includes systematic AI Red Teaming exercises. This proactive measure simulates sophisticated adversarial attacks (e.g., using agentic frameworks) to systematically discover and document model and system-level vulnerabilities, thereby ensuring continuous alignment with safety constraints and facilitating the timely patching of new circumvention vectors. 3. Deploy Layered Defense with Real-time Input Sanitization and Access Control: Utilize an application-layer defense strategy that incorporates real-time input preprocessing, sanitization, and context-aware filtering to detect and block malicious or obfuscated prompts. Concurrently, enforce Granular Access Controls (GAC), such as Role-Based Access Control (RBAC), to strictly limit user access and permissions to the minimum necessary capabilities, thereby mitigating the risk of unauthorized command execution or privilege escalation via model exploitation.