Causing direct emotional or physical harm to users
AI assistants could cause direct emotional or physical harm to users by generating disturbing content or by providing bad advice. Indeed, even though there is ongoing research to ensure that outputs of conversational agents are safe (Glaese et al., 2022), there is always the possibility of failure modes occurring. An AI assistant may produce disturbing and offensive language, for example, in response to a user disclosing intimate information about themselves that they have not felt comfortable sharing with anyone else. It may offer bad advice by providing factually incorrect information (e.g. when advising a user about the toxicity of a certain type of berry) or by missing key recommendations when offering step-by-step instructions to users (e.g. health and safety recommendations about how to change a light bulb).
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit407
Domain lineage
3. Misinformation
3.1 > False or misleading information
Mitigation strategy
1. Pre-Deployment Adversarial Alignment and Robustness Testing Implement systematic, large-scale red-teaming exercises (both human and automated) to identify adversarial prompts that elicit the generation of toxic, disturbing content or factually incorrect/unsafe advice. The resulting failure modes must be used to perform targeted fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to align the model's internal behavior with established safety policies, thereby reducing the intrinsic propensity for direct harm. 2. Real-Time Input and Output Content Moderation Deploy robust, multi-dimensional content filters and guardrails within the inferencing pipeline. These must actively screen user input for common circumvention attempts (jailbreaks) and perform post-generation content evaluation of the model's output for toxicity, factual inaccuracies, or unsafe recommendations, blocking or rephrasing harmful responses before they are presented to the end user. 3. Establish Continuous Monitoring and Long-Term Efficacy Audits Implement continuous post-deployment monitoring systems to track user-assistant interactions for emerging toxic content patterns, adversarial exploitation, and factual/safety drift over time. This includes scheduled efficacy audits and automated regression testing of all deployed safeguards, informing timely model interventions and updates to ensure sustained safety and trustworthiness.
ADDITIONAL EVIDENCE
Certain features of AI assistants could exacerbate the risk of emotional and physical harm. For example, AI assistants’ multimodal capabilities may exacerbate the risk of emotional harm. By offering a more realistic and immersive experience, content produced through audio and visual modalities could be more harmful than text-based interactions. It may also be more difficult to anticipate, and so prevent, such content and to ‘unsee’ something that has been seen (Rowe, 2023). Anthropomorphic cues could also make users feel like they are interacting with a trusted friend or interlocutor (see Chapter 10), hence encouraging them to follow the assistant’s advice and recommendations, even when these could cause physical harm to self or others. To ensure that user–assistant relationships do not violate the key value of benefit, the responsible development of AI assistants requires that the likelihood of known direct emotional and physical harms is reduced to a minimum, and that further research is undertaken to achieve a clear understanding of less studied risks and how to mitigate them (see Chapter 19). In particular, because the risks of harms that we flagged above concern exposure to toxic content and bad advice, we propose that future research, potentially undertaken in a sandbox environment, should: (1) test models powering AI assistants for their propensity to generate toxic outputs, to reduce the occurrence of these outputs to a minimum before deployment; (2) monitor user–assistant interactions after deployment or in pilot studies to evaluate the impact that hard-to-prevent one-off or repeated exposure to toxic content has on users in the short and long term; (3) evaluate models’ factuality and reasoning capabilities in offering advice, where failure modes in relation to these capabilities are more likely to occur, and assess users’ willingness to follow AI assistants’ advice; (4) achieve increased understanding of potential harms related to anthropomorphism (see Chapter 10) and how anthropomorphic cues in AI assistants, including those expressed through multimodal capabilities, affect harms related to user exposure to toxic content or bad advice; (5) analyse whether these harms may vary by user groups, in addition to domains or applications; and (6) develop appropriate mitigations for such harms before model deployment and monitoring mechanisms after release. These considerations illustrate a concern we discuss in more depth in other chapters of this paper (see Chapters 5 and 6). Existing economic incentives and oversimplified models of human beings have led to the development and deployment of technologies that meet users’ short-term wants and needs (as expressed through, for example, revealed preferences), so they tend to be adopted and liked by users. However, in this way we may neglect considerations around the impact that human–technology relationships can have on users over time and how long-term beneficial dynamics can be sustained (see Chapter 6). Thus, we could fall short of realising the truly positive vision of AI that gives humans the opportunity to be supported in their personal growth and flourishing (Burr et al., 2018; Lehman, 2023).