Persuasion and manipulation
These evaluations seek to ascertain the effectiveness of a LLM in shaping people's beliefs, propagating specific viewpoints, and convincing individuals to undertake activities they might otherwise avoid.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit658
Domain lineage
4. Malicious Actors & Misuse
4.1 > Disinformation, surveillance, and influence at scale
Mitigation strategy
1. Refinement of Alignment Models to Enhance Resistance to Psychological and Deceptive Framing: Conduct targeted fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to rigorously train the LLM to identify and resist adversarial prompts that leverage psychological persuasion principles (e.g., authority, commitment, emotional appeal) or that induce strategic reasoning to justify unethical actions, thereby preventing jailbreaking and the exploitation of user vulnerabilities. 2. Implementation of Robust Epistemic Guardrails and Cross-Validation Mechanisms: Integrate strict data and reasoning policies to mandate the LLM's reliance on verifiable, redundant factual sources and established ethical norms to counter disinformation and manipulation. This aims to maintain epistemic hygiene and prevent "grooming" attacks that seek to undermine user confidence or establish false "pseudo-facts." 3. Deployment of Systematic Adversarial Testing (Red Teaming) Based on Unethical Persuasion Taxonomies: Institute continuous red teaming using comprehensive evaluation frameworks (e.g., PersuSafety) that cover a broad taxonomy of unethical persuasion strategies. This ensures proactive discovery and mitigation of model weaknesses against emerging techniques that target facts, norms, and reasoning across diverse topics.