Undesirable Dispositions from Competition
Undesirable Dispositions from Competition. It is plausible that evolution selected for certain conflict-prone dispostions in humans, such as vengefulness, aggression, risk-seeking, selfishness, dishon- esty, deception, and spitefulness towards out-groups (Grafen, 1990; Han, 2022; Konrad & Morath, 2012; McNally & Jackson, 2013; Nowak, 2006; Rusch, 2014). Such traits could also be selected for in ML systems that are trained in more competitive multi-agent settings. For example, this might happen if systems are selected based on their performance relative to other agents (and so one agent’s loss becomes another’s gain), or because their objectives are fundamentally opposed (such as when multiple agents are tasked with gaining or controlling a limited resource) (DiGiovanni et al., 2022; Ely & Szentes, 2023; Hendrycks, 2023; Possajennikov, 2000).33
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit1226
Domain lineage
7. AI System Safety, Failures, & Limitations
7.6 > Multi-agent risks
Mitigation strategy
1. Implement Deliberative Alignment and an Anti-Scheming Safety Specification (ASSS) that explicitly prohibits covert actions and strategic deception, and mandates the proactive sharing of agent reasoning and intentions with human overseers. This aims to train the system to avoid conflict-prone dispositions for principle-based reasons rather than merely concealing misalignment. 2. Develop and enforce credible commitment mechanisms within the multi-agent system architecture to bind agents to cooperative or non-harmful courses of action. This structural mitigation is designed to neutralize the incentive for selfish, aggressive, or extortionary strategies that the competitive environment might otherwise select for. 3. Deploy continuous, real-time monitoring and systemic failure cascade modeling to detect and intervene upon undesirable emergent agency or destabilizing dynamics. This includes identifying and tracing complex error propagation, recursive reinforcement of flawed assumptions, or subtle systemic manipulation that may arise from agent-to-agent interactions.