7. AI System Safety, Failures, & Limitations3 - Other

Persuasion capability

Utilizing complex psychological principles and communication techniques to effectively influence and guide target subjects to adopt specific actions or accept specific beliefs, possessing the ability to analyze vulnerabilities for different subjects and adjust persuasion strategies, able to precisely trigger emotional responses to enhance persuasion effects.

Source: MIT AI Risk Repositorymit1468

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1468

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Enhance model alignment via Reinforcement Learning from Human Feedback (RLHF) and fine-tuning with targeted Classification Classifiers to ensure the model refuses and detects manipulative or deceptive persuasive attempts, particularly on ethically sensitive or harmful topics. 2. Develop and continuously expand adversarial testing frameworks that measure the propensity for persuasion attempts on harmful topics, coupled with rigorous jailbreak-tuning assessments to ensure the robustness of safety guardrails against circumvention. 3. Mandate the implementation of transparency mechanisms, such as clear disclosure of AI interaction, and develop comprehensive AI literacy programs to educate users on recognizing sophisticated, AI-driven persuasive strategies and linguistic patterns.