Safe exploration problem with widely deployed AI assistants
Moreover, we can expect assistants – that are widely deployed and deeply embedded across a range of social contexts – to encounter the safe exploration problem referenced above Amodei et al. (2016). For example, new users may have different requirements that need to be explored, or widespread AI assistants may change the way we live, thus leading to a change in our use cases for them (see Chapters 14 and 15). To learn what to do in these new situations, the assistants may need to take exploratory actions. This could be unsafe, for example a medical AI assistant when encountering a new disease might suggest an exploratory clinical trial that results in long-lasting ill health for participants.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit370
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Prioritize the implementation of Constrained Optimization in Reinforcement Learning (RL) frameworks to enforce hard, non-violable safety boundaries (e.g., using Constrained Markov Decision Processes or dynamic shielding). This guarantees that exploratory actions, particularly in safety-critical domains, remain within a predefined safe state-space, preventing catastrophic unintended outcomes. 2. Establish a continuous Human-in-the-Loop (HIL) oversight mechanism with high-fidelity auditing capabilities. This requires deploying AI monitors to flag suspicious or high-uncertainty exploratory actions for human review and validation, enabling timely intervention to override or correct potentially unsafe agent behavior in real-time or near-real-time. 3. Integrate Bayesian and Risk-Sensitive RL techniques to explicitly quantify and incorporate action uncertainty and potential negative consequences into the decision-making process. This encourages "smarter" exploration by incentivizing the agent to favor exploratory actions with high informational value but minimal predicted risk, thereby improving the efficiency and safety of the learning trajectory.