Back to the MIT repository
4. Malicious Actors & Misuse2 - Post-deployment

High-impact misuses and abuses beyond original purpose

Since general-purpose AI systems have a large repertoire of capabilities, mali- cious actors such as foreign actors can use such systems to cause large damage if they gain unrestricted or unmonitored access to those AI systems.

Source: MIT AI Risk Repositorymit1166

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1166

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.0 > Malicious use

Mitigation strategy

1. Implement **Strategic Friction and Capability-Gating Protocols**. Deploy multi-layered technical constraints, such as granular **rate limits** and **context-aware refusal mechanisms**, to inhibit the rapid, large-scale generation of malicious content. Furthermore, institute a **progressive trust model** that restricts access to the most high-impact, generalized capabilities until a user's identity and benign intent have been technically verified. 2. Enforce **Robust Safety Alignment and Filtering** through the AI Development Lifecycle. Apply advanced **fine-tuning** techniques, such as Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF), to condition the model to actively and reliably reject requests for dangerous or prohibited outputs. Supplement this with **input/output filtering** via classifiers to detect and block malicious prompts or synthetically generated harmful content before it is processed or delivered. 3. Establish **Continuous Governance and Auditable Access Controls**. Develop an overarching **AI Governance framework** that mandates continuous, real-time monitoring of system usage for anomalous activity indicative of malicious exploitation (e.g., prompt injection attempts or high-volume query patterns in sensitive domains). This framework must also enforce **stringent access management** over the general-purpose AI system's deployment environment and application programming interfaces (APIs) to prevent unauthorized or unmonitored access by external or insider threat actors.