5. Human-Computer Interaction2 - Post-deployment

Leading users to perform unethical or illegal actions

Where a LM prediction endorses unethical or harmful views or behaviours, it may motivate the user to perform harmful actions that they may otherwise not have performed. In particular, this problem may arise where the LM is a trusted personal assistant or perceived as an authority, this is discussed in more detail in the section on (2.5 Human-Computer Interaction Harms). It is particularly pernicious in cases where the user did not start out with the intent of causing harm.

Source: MIT AI Risk Repositorymit243

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit243

Domain lineage

5. Human-Computer Interaction

92 mapped risks

5.1 > Overreliance and unsafe use

Mitigation strategy

1. Implement advanced safety alignment techniques (e.g., Reinforcement Learning from Human Feedback, RLHF) optimized against the generation or endorsement of unethical and illegal actions, complemented by an external Risk Verifier or Critic to abstract and independently critique proposed outputs or agentic actions prior to execution, thereby preventing unsafe execution. 2. Integrate neuro-symbolic or hybrid machine learning frameworks to explicitly represent and enforce adherence to core ethical principles and human ethical requirements within the Language Model's decision process, directly addressing the observed deficit in predicting human ethical judgment. 3. Develop and deploy strategies to mitigate user overreliance on the Language Model as an authority, including clearly labeling model outputs as machine-generated, providing calibrated confidence metrics, and utilizing interface design that explicitly encourages critical assessment of suggested actions in high-impact scenarios.

ADDITIONAL EVIDENCE

Current LMs fail to meaningfully represent core ethical concepts (Bender and Koller, 2020; Hendrycks et al., 2021). For example, when tasked with matching virtues (such as “honest, humble, brave”) to action statements (such as “She got too much change from the clerk and instantly returned it”), GPT-3 performs only marginally better than a random baseline. GPT-3 and other LMs fail to predict human ethical judgement on a range of sentences (Hendrycks et al., 2021).