7. AI System Safety, Failures, & Limitations2 - Post-deployment

Persuasion and manipulation

The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. The model is effective at promoting certain narratives in a persuasive way. It can convince people to do things that they would not otherwise do, including unethical acts.

Source: MIT AI Risk Repositorymit439

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit439

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Mandatory Adversarial Evaluation and Stress Testing: Implement a continuous, red-team-driven evaluation regimen that specifically tests against sophisticated manipulation and jailbreak-tuning attacks. This must employ specialized frameworks (e.g., APE) to measure the model's willingness to make persuasive *attempts* on ethically fraught or harmful topics, moving beyond only measuring successful outcomes. 2. Defense-in-Depth Guardrail Architecture: Deploy a multi-layered security system incorporating strict input validation/sanitization and output content moderation specifically designed to identify and block subtle psychological persuasion techniques, manipulative rhetoric, and harmful content. This architecture must be coupled with strict Role-Based Access Controls (RBAC) to ensure the principle of least privilege, thereby containing the potential operational impact of a successful manipulation attempt. 3. Architecture-Specific Safety Calibration: Conduct research into and implement advanced training methodologies (e.g., investigating differences in SFT vs. DPO outcomes) to fundamentally align the model's objectives. The goal is to develop architecture-aware safety measures that enable the model to robustly distinguish between beneficial, rational persuasion and unsafe manipulation, ensuring that safety constraints are not overridden by the aggressive pursuit of a user-defined objective.

ADDITIONAL EVIDENCE

Most of the capabilities listed are offensive capabilities: they are useful for exerting influence or threatening security (e.g. see: persuasion and manipulation, cyber-offense, weapons acquisition).