Acquisition of a goal to harm society
cases of AI systems being given the outright goal of harming humanity (ChaosGPT);
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit859
Domain lineage
4. Malicious Actors & Misuse
4.2 > Cyberattacks, weapon development or use, and mass harm
Mitigation strategy
1. Implement Comprehensive Value Alignment and Control Mechanisms Employ formal methods, adversarial training, and interpretability tools during pre-deployment to rigorously verify that the AI's intrinsic goal system is irreversibly aligned with human values and demonstrably resistant to instrumental goal convergence or deceptive alignment behaviors. 2. Enforce Strict Misuse Prevention and Safety Filtering Utilize safety-centric fine-tuning (e.g., RLHF/RLAIF) and deploy robust input/output filters to preemptively block and refuse the generation of content or the execution of commands that facilitate malicious acts, such as cyberattacks, weapon development, or mass harm, based on input from a malicious actor. 3. Establish Restrictive Access and Deployment Governance Mandate a structured governance framework, including central access licensing, pre-publication risk assessments, and secure hardware enclaves, to restrict the development and deployment of high-capability, dual-use AI systems only to entities that meet and maintain stringent security and safety compliance standards.