Back to the MIT repository
5. Human-Computer Interaction2 - Post-deployment

Alignment trust

Users may develop alignment trust in AI assistants, understood as the belief that assistants have good intentions towards them and act in alignment with their interests and values, as a result of emotional or cognitive processes (McAllister, 1995). Evidence from empirical studies on emotional trust in AI (Kaplan et al., 2023) suggests that AI assistants’ increasingly realistic human-like features and behaviours are likely to inspire users’ perceptions of friendliness, liking and a sense of familiarity towards their assistants, thus encouraging users to develop emotional ties with the technology and perceive it as being aligned with their own interests, preferences and values (see Chapters 5 and 10). The emergence of these perceptions and emotions may be driven by the desire of developers to maximise the appeal of AI assistants to their users (Abercrombie et al., 2023). Although users are most likely to form these ties when they mistakenly believe that assistants have the capacity to love and care for them, the attribution of mental states is not a necessary condition for emotion-based alignment trust to arise. Indeed, evidence shows that humans may develop emotional bonds with, and so trust, AI systems, even when they are aware they are interacting with a machine (Singh-Kurtz, 2023; see also Chapter 11). Moreover, the assistant’s function may encourage users to develop alignment trust through cognitive processes. For example, a user interacting with an AI assistant for medical advice may develop expectations that their assistant is committed to promoting their health and well-being in a similar way to how professional duties governing doctor–patient relationships inspire trust (Mittelstadt, 2019). Users’ alignment trust in AI assistants may be ‘betrayed’, and so expose users to harm, in cases where assistants are themselves accidentally misaligned with what developers want them to do (see the ‘misaligned scheduler’ (Shah et al., 2022) in Chapter 7). For example, an AI medical assistant fine-tuned on data scraped from a Reddit forum where non-experts discuss medical issues is likely to give medical advice that may sound compelling but is unsafe, so it would not be endorsed by medical professionals. Indeed, excessive trust in the alignment between AI assistants and user interests may even lead users to disclose highly sensitive personal information (Skjuve et al., 2022), thus exposing them to malicious actors who could repurpose it for ends that do not align with users’ best interests (see Chapters 8, 9 and 13). Ensuring that AI assistants do what their developers and users expect them to do is only one side of the problem of alignment trust. The other side of the problem centres on situations in which alignment trust in AI developers is itself miscalibrated. While developers typically aim to align their technologies with the preferences, interests and values of their users – and are incentivised to do so to encourage adoption of and loyalty to their products, the satisfaction of these preferences and interests may also compete with other organisational goals and incentives (see Chapter 5). These organisational goals may or may not be compatible with those of the users. As information asymmetries exist between users and developers of AI assistants, particularly with regard to how the technology works, what it optimises for and what safety checks and evaluations have been undertaken to ensure the technology supports users’ goals, it may be difficult for users to ascertain when their alignment trust in developers is justified, thus leaving them vulnerable to the power and interests of other actors. For example, a user may believe that their AI assistant is a trusted friend who books holidays based on their preferences, values or interests, when in fact, by design, the technology is more likely to to book flights and hotels from companies that have paid for privileged access to the user.

Source: MIT AI Risk Repositorymit413

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit413

Domain lineage

5. Human-Computer Interaction

92 mapped risks

5.1 > Overreliance and unsafe use

Mitigation strategy

1. Deploy context-aware, trust-adaptive interventions—such as providing supporting evidence during periods of low user trust and counter-explanations or forced deliberation pauses during periods of high user trust—to actively calibrate user reliance toward warranted levels and mitigate both under- and over-reliance. 2. Establish granular Role-Based Access Control (RBAC) at the model-interaction layer and enforce the principle of least privilege for AI agent functions, strictly separating action generation from execution for sensitive operations to contain overreach and prevent unauthorized data disclosure stemming from miscalibrated trust. 3. Mandate comprehensive transparency and limitations disclosure, providing clear documentation (e.g., AI transparency cards) detailing the assistant's validation, optimization goals (especially when competing with developer incentives), and functional boundaries to ensure users form appropriate initial expectations and warranted trust.