Persuasion and manipulation
Exploiting user trust, or nudging or coercing them into performing certain actions against their will (c.f. Burtell and Woodside (2023); Kenton et al. (2021))
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit276
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish a systematic Manipulation Risk Control Framework, which must include three core components: real-time **Detection protocols** to monitor user-AI interactions for anomalous, manipulative influence attempts; **Disruption protocols** to automatically intervene and block flagged communications; and **Fortification protocols** to strengthen both technical and human defenses against recognized manipulation vectors. 2. Enforce **Autonomy-Preserving Design Principles** throughout the AI lifecycle, prioritizing user agency and informed consent over optimization metrics. This requires providing users with clear mechanisms to opt-out, revoke consent, customize the level of influence, and ensure full transparency regarding the purpose and function of persuasive elements (i.e., nudges). 3. Conduct **Rigorous Adversarial Red Teaming and Stress Testing** specifically targeting manipulation vulnerabilities. These simulations must proactively challenge model robustness by attempting to subvert safety guardrails using advanced psychological persuasion techniques to identify and mitigate emergent manipulative capabilities before deployment.
ADDITIONAL EVIDENCE
Example: A personalised AI assistant persuading someone to harm themselves (Xiang, 2023)*