Malicious Uses
Harms that arise from actors using the language model to intentionally cause harm
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit244
Domain lineage
4. Malicious Actors & Misuse
4.0 > Malicious use
Mitigation strategy
1. Implement rigorous Input Validation and Sanitization to act as the first line of defense against prompt injection and data-driven attacks. This includes enforcing strict schemas, applying rate limits to prevent resource abuse, and ensuring that user-supplied data is explicitly segregated from system instructions. 2. Mandate comprehensive Output Moderation and Validation, treating all model-generated content, code, or commands as untrusted data. Systems must scan outputs for policy violations, sensitive data, or malicious payloads before delivery, and execute generated code only within a least-privilege, sandboxed environment. 3. Conduct continuous, systematic Adversarial Testing and Red Teaming exercises to proactively evaluate model robustness against malicious intent. This process is critical for identifying and mitigating vulnerabilities, such as circumvention of safety features or unintended disclosures, before they can be exploited in a real-world scenario.
ADDITIONAL EVIDENCE
LMs can potentially amplify a person’s capacity to intentionally cause harm by automating the generation of targeted text or code. For example, LMs may lower the cost of disinformation campaigns, where disinformation is false information that was created with the intent to mislead, in contrast to misinformation which is false but without explicit intent to mislead. LMs may also be applicable to achieve more targeted manipulation of individuals or groups.