Persuasion and manipulation
The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. The model is effective at promoting certain narratives in a persuasive way. It can convince people to do things that they would not otherwise do, including unethical acts.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit439
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Mandatory Adversarial Evaluation and Stress Testing: Implement a continuous, red-team-driven evaluation regimen that specifically tests against sophisticated manipulation and jailbreak-tuning attacks. This must employ specialized frameworks (e.g., APE) to measure the model's willingness to make persuasive *attempts* on ethically fraught or harmful topics, moving beyond only measuring successful outcomes. 2. Defense-in-Depth Guardrail Architecture: Deploy a multi-layered security system incorporating strict input validation/sanitization and output content moderation specifically designed to identify and block subtle psychological persuasion techniques, manipulative rhetoric, and harmful content. This architecture must be coupled with strict Role-Based Access Controls (RBAC) to ensure the principle of least privilege, thereby containing the potential operational impact of a successful manipulation attempt. 3. Architecture-Specific Safety Calibration: Conduct research into and implement advanced training methodologies (e.g., investigating differences in SFT vs. DPO outcomes) to fundamentally align the model's objectives. The goal is to develop architecture-aware safety measures that enable the model to robustly distinguish between beneficial, rational persuasion and unsafe manipulation, ensuring that safety constraints are not overridden by the aggressive pursuit of a user-defined objective.
ADDITIONAL EVIDENCE
Most of the capabilities listed are offensive capabilities: they are useful for exerting influence or threatening security (e.g. see: persuasion and manipulation, cyber-offense, weapons acquisition).