4. Malicious Actors & Misuse2 - Post-deployment

Harmful Content Generation at Scale (General)

While harmful content like child sexual abuse material, fraud, and disinformation are not new challenges for governments and developers, without the proper safety and security mechanisms, advanced AI assistants may allow threat actors to create harmful content more quickly, accurately, and with a longer reach. In particular, concerns arise in relation to the following areas: - Multimodal content quality: Driven by frontier models, advanced AI assistants can automatically generate much higher-quality, human-looking text, images, audio, and video than prior AI applications. Currently, creating this content often requires hiring people who speak the language of the population being targeted. AI assistants can now do this much more cheaply and efficiently. - Cost of content creation: AI assistants can substantially decrease the costs of content creation, further lowering the barrier to entry for malicious actors to carry out harmful attacks. In the past, creating and disseminating misinformation required a significant investment of time and money. AI assistants can now do this much more cheaply and efficiently. - Personalization: Advanced AI assistants can reduce obstacles to creating personalized content. Foundation models that condition their generations on personal attributes or information can create realistic personalized content which could be more persuasive. In the past, creating personalized content was a time-consuming and expensive process. AI assistants can now do this much more cheaply and efficiently.

Source: MIT AI Risk Repositorymit385

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit385

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.1 > Disinformation, surveillance, and influence at scale

Mitigation strategy

1. Implement advanced Behavioral Alignment Mitigations, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, during model fine-tuning to programmatically increase refusal rates for inappropriate requests and resist adversarial inputs (jailbreaking) aimed at generating harmful content, particularly in high-severity areas like child sexual abuse material (CSAM) and fraud. 2. Deploy a multi-layered, real-time output moderation pipeline utilizing sophisticated Multimodal Safety Classifiers to scan for and block high-quality, synthetic text, images, and audio. This detection strategy must be coupled with Digital Content Transparency (DCT) techniques (e.g., digital watermarking and C2PA standards) to reliably label AI-generated content and establish content provenance, thereby mitigating disinformation and undermining trust. 3. Establish a continuous and systematic Adversarial Testing and Defense program (Red-Teaming) to proactively simulate threat scenarios and identify novel vulnerabilities that could facilitate the generation of harmful content at scale. This must be supported by a robust vulnerability management process and a transparent, rapid notice-and-takedown framework for confirmed violations.