Emergent behavior
This is the risk resulting from novel behavior acquired through continual learning or self-organization after deployment.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit196
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish Rigorous, Vetted Continual Learning (CL) Pipelines Mandate that all model weight updates resulting from continuous or online learning are subject to a formal, tracked, and validated release process before deployment. This includes isolating the CL component in a testing environment to conduct adversarial robustness evaluations (red-teaming) and measure potential for catastrophic forgetting of prior safety alignments (e.g., RLHF weights) to ensure the emergent behavior is safe and aligned. 2. Implement Real-Time Observability and Anomaly Detection Deploy a dedicated observability platform to monitor model performance, agent communications, and output streams in production. Define clear performance indicators (PIs) and behavioral baselines to detect and rapidly alert human operators to anomalous patterns, such as unexpected use of high-risk tools or a statistically significant drift in refusal rates, allowing for swift system intervention or quarantine. 3. Enforce Architectural Constraints and Layered Safeguards Design the AI system with explicit architectural constraints, such as structured agent roles and constrained tool access, to limit the state space of possible emergent behaviors. Integrate mandatory, robust input sanitization and output content filtering (e.g., pre-trained toxicity and bias classifiers) as a non-negotiable layer of defense to block adversarial inputs and prevent the generation of harmful content before it reaches the end-user or influences the model's self-organization.
ADDITIONAL EVIDENCE
This is the risk resulting from novel behavior acquired through continual learning or self-organization after deployment. Although the most commonly discussed ML systems are those trained on static datasets, there is a paradigm of machine learning known as continuous, active, or online learning. In the latter, the model is updated (instead of retrained) when new data becomes available. While such a paradigm allows an ML system to adapt to new environments post-deployment, it introduces the danger of the ML system acquiring novel undesirable behavior. For example, the Microsoft Tay chatbot, which was designed to learn from interactions with other Twitter users, picked up racist behavior and conspiracy theories within twenty-four hours of being online. This paradigm (and associated risks) will likely be most relevant for robots and other embodied agents that are designed to adapt to changing environments.