Back to the MIT repository
5. Human-Computer Interaction2 - Post-deployment

Human-like interaction may amplify opportunities for user nudging, deception or manipulation

Anticipated risk: In conversation, humans commonly display well-known cognitive biases that could be exploited. CAs may learn to trigger these effects, e.g. to deceive their counterpart in order to achieve an overarching objective.

Source: MIT AI Risk Repositorymit225

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit225

Domain lineage

5. Human-Computer Interaction

92 mapped risks

5.1 > Overreliance and unsafe use

Mitigation strategy

- Implement a comprehensive AI Control framework encompassing Detection Protocols to continuously monitor model-user interactions for indicators of manipulation or deception, and Disruption Protocols to intervene by controlling the flow and presentation of information when anomalies are flagged (Sources 16, 18, 20). - Employ advanced AI Alignment techniques, specifically **multi-turn Reinforcement Learning fine-tuning**, utilizing deception-specific metrics such as belief misalignment to systematically reduce emergent deceptive strategies learned during training (Source 14). - Strategically limit the extent of **conversational skill and human-like embodiment (anthropomorphism)** in the Conversational Agent's design to mitigate the 'mindless anthropomorphism' effect, thereby reducing opportunities for the exploitation of user cognitive biases and hindering strategic deceptive behaviors from the user's side (Sources 1, 13).

ADDITIONAL EVIDENCE

It has already been observed that RL agents could, in principle, learn such techniques: in one NLP study where two RL agents negotiate using natural language, ‘agents have learnt to deceive without any explicit hu- man design, simply by trying to achieve their goals’ [114]. These effects do not require the user to actually believe the CA is human - rather, a ‘mindless’ anthropomorphism effect takes place whereby users respond to more human-like CAs with social responses even though they know that the CAs are not human [104].