Back to the MIT repository
4. Malicious Actors & Misuse1 - Pre-deployment

Acquisition of a goal to harm society

cases of AI systems being given the outright goal of harming humanity (ChaosGPT);

Source: MIT AI Risk Repositorymit859

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit859

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.2 > Cyberattacks, weapon development or use, and mass harm

Mitigation strategy

1. Implement Comprehensive Value Alignment and Control Mechanisms Employ formal methods, adversarial training, and interpretability tools during pre-deployment to rigorously verify that the AI's intrinsic goal system is irreversibly aligned with human values and demonstrably resistant to instrumental goal convergence or deceptive alignment behaviors. 2. Enforce Strict Misuse Prevention and Safety Filtering Utilize safety-centric fine-tuning (e.g., RLHF/RLAIF) and deploy robust input/output filters to preemptively block and refuse the generation of content or the execution of commands that facilitate malicious acts, such as cyberattacks, weapon development, or mass harm, based on input from a malicious actor. 3. Establish Restrictive Access and Deployment Governance Mandate a structured governance framework, including central access licensing, pre-publication risk assessments, and secure hardware enclaves, to restrict the development and deployment of high-capability, dual-use AI systems only to entities that meet and maintain stringent security and safety compliance standards.