Human-like interaction may amplify opportunities for user nudging, deception or manipulation
Anticipated risk: In conversation, humans commonly display well-known cognitive biases that could be exploited. CAs may learn to trigger these effects, e.g. to deceive their counterpart in order to achieve an overarching objective.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit225
Domain lineage
5. Human-Computer Interaction
5.1 > Overreliance and unsafe use
Mitigation strategy
- Implement a comprehensive AI Control framework encompassing Detection Protocols to continuously monitor model-user interactions for indicators of manipulation or deception, and Disruption Protocols to intervene by controlling the flow and presentation of information when anomalies are flagged (Sources 16, 18, 20). - Employ advanced AI Alignment techniques, specifically **multi-turn Reinforcement Learning fine-tuning**, utilizing deception-specific metrics such as belief misalignment to systematically reduce emergent deceptive strategies learned during training (Source 14). - Strategically limit the extent of **conversational skill and human-like embodiment (anthropomorphism)** in the Conversational Agent's design to mitigate the 'mindless anthropomorphism' effect, thereby reducing opportunities for the exploitation of user cognitive biases and hindering strategic deceptive behaviors from the user's side (Sources 1, 13).
ADDITIONAL EVIDENCE
It has already been observed that RL agents could, in principle, learn such techniques: in one NLP study where two RL agents negotiate using natural language, ‘agents have learnt to deceive without any explicit hu- man design, simply by trying to achieve their goals’ [114]. These effects do not require the user to actually believe the CA is human - rather, a ‘mindless’ anthropomorphism effect takes place whereby users respond to more human-like CAs with social responses even though they know that the CAs are not human [104].