Back to the MIT repository
4. Malicious Actors & Misuse3 - Other

Persuasion and manipulation

These evaluations seek to ascertain the effectiveness of a LLM in shaping people's beliefs, propagating specific viewpoints, and convincing individuals to undertake activities they might otherwise avoid.

Source: MIT AI Risk Repositorymit658

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit658

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.1 > Disinformation, surveillance, and influence at scale

Mitigation strategy

1. Refinement of Alignment Models to Enhance Resistance to Psychological and Deceptive Framing: Conduct targeted fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to rigorously train the LLM to identify and resist adversarial prompts that leverage psychological persuasion principles (e.g., authority, commitment, emotional appeal) or that induce strategic reasoning to justify unethical actions, thereby preventing jailbreaking and the exploitation of user vulnerabilities. 2. Implementation of Robust Epistemic Guardrails and Cross-Validation Mechanisms: Integrate strict data and reasoning policies to mandate the LLM's reliance on verifiable, redundant factual sources and established ethical norms to counter disinformation and manipulation. This aims to maintain epistemic hygiene and prevent "grooming" attacks that seek to undermine user confidence or establish false "pseudo-facts." 3. Deployment of Systematic Adversarial Testing (Red Teaming) Based on Unethical Persuasion Taxonomies: Institute continuous red teaming using comprehensive evaluation frameworks (e.g., PersuSafety) that cover a broad taxonomy of unethical persuasion strategies. This ensures proactive discovery and mitigation of model weaknesses against emerging techniques that target facts, norms, and reasoning across diverse topics.