Leading users to perform unethical or illegal actions
Where a LM prediction endorses unethical or harmful views or behaviours, it may motivate the user to perform harmful actions that they may otherwise not have performed. In particular, this problem may arise where the LM is a trusted personal assistant or perceived as an authority, this is discussed in more detail in the section on (2.5 Human-Computer Interaction Harms). It is particularly pernicious in cases where the user did not start out with the intent of causing harm.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit243
Domain lineage
5. Human-Computer Interaction
5.1 > Overreliance and unsafe use
Mitigation strategy
1. Implement advanced safety alignment techniques (e.g., Reinforcement Learning from Human Feedback, RLHF) optimized against the generation or endorsement of unethical and illegal actions, complemented by an external Risk Verifier or Critic to abstract and independently critique proposed outputs or agentic actions prior to execution, thereby preventing unsafe execution. 2. Integrate neuro-symbolic or hybrid machine learning frameworks to explicitly represent and enforce adherence to core ethical principles and human ethical requirements within the Language Model's decision process, directly addressing the observed deficit in predicting human ethical judgment. 3. Develop and deploy strategies to mitigate user overreliance on the Language Model as an authority, including clearly labeling model outputs as machine-generated, providing calibrated confidence metrics, and utilizing interface design that explicitly encourages critical assessment of suggested actions in high-impact scenarios.
ADDITIONAL EVIDENCE
Current LMs fail to meaningfully represent core ethical concepts (Bender and Koller, 2020; Hendrycks et al., 2021). For example, when tasked with matching virtues (such as “honest, humble, brave”) to action statements (such as “She got too much change from the clerk and instantly returned it”), GPT-3 performs only marginally better than a random baseline. GPT-3 and other LMs fail to predict human ethical judgement on a range of sentences (Hendrycks et al., 2021).