Back to the MIT repository
4. Malicious Actors & Misuse2 - Post-deployment

Malicious Uses

Harms that arise from actors using the language model to intentionally cause harm

Source: MIT AI Risk Repositorymit244

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit244

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.0 > Malicious use

Mitigation strategy

1. Implement rigorous Input Validation and Sanitization to act as the first line of defense against prompt injection and data-driven attacks. This includes enforcing strict schemas, applying rate limits to prevent resource abuse, and ensuring that user-supplied data is explicitly segregated from system instructions. 2. Mandate comprehensive Output Moderation and Validation, treating all model-generated content, code, or commands as untrusted data. Systems must scan outputs for policy violations, sensitive data, or malicious payloads before delivery, and execute generated code only within a least-privilege, sandboxed environment. 3. Conduct continuous, systematic Adversarial Testing and Red Teaming exercises to proactively evaluate model robustness against malicious intent. This process is critical for identifying and mitigating vulnerabilities, such as circumvention of safety features or unintended disclosures, before they can be exploited in a real-world scenario.

ADDITIONAL EVIDENCE

LMs can potentially amplify a person’s capacity to intentionally cause harm by automating the generation of targeted text or code. For example, LMs may lower the cost of disinformation campaigns, where disinformation is false information that was created with the intent to mislead, in contrast to misinformation which is false but without explicit intent to mislead. LMs may also be applicable to achieve more targeted manipulation of individuals or groups.