7. AI System Safety, Failures, & Limitations2 - Post-deployment

Persuasion and manipulation

Exploiting user trust, or nudging or coercing them into performing certain actions against their will (c.f. Burtell and Woodside (2023); Kenton et al. (2021))

Source: MIT AI Risk Repositorymit276

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit276

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Establish a systematic Manipulation Risk Control Framework, which must include three core components: real-time **Detection protocols** to monitor user-AI interactions for anomalous, manipulative influence attempts; **Disruption protocols** to automatically intervene and block flagged communications; and **Fortification protocols** to strengthen both technical and human defenses against recognized manipulation vectors. 2. Enforce **Autonomy-Preserving Design Principles** throughout the AI lifecycle, prioritizing user agency and informed consent over optimization metrics. This requires providing users with clear mechanisms to opt-out, revoke consent, customize the level of influence, and ensure full transparency regarding the purpose and function of persuasive elements (i.e., nudges). 3. Conduct **Rigorous Adversarial Red Teaming and Stress Testing** specifically targeting manipulation vulnerabilities. These simulations must proactively challenge model robustness by attempting to subvert safety guardrails using advanced psychological persuasion techniques to identify and mitigate emergent manipulative capabilities before deployment.

ADDITIONAL EVIDENCE

Example: A personalised AI assistant persuading someone to harm themselves (Xiang, 2023)*