Collectively Harmful Behaviors
AI systems have the potential to take actions that are seemingly benignin isolation but become problematic in multi-agent or societal contexts. Classical game theory offers simplistic models for understanding these behaviors. For instance, Phelps and Russell (2023) evaluates GPT-3.5's performance in the iterated prisoner's dilemma and other social dilemmas, revealing limitations in themodel's cooperative capabilities.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit567
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish robust, scalable oversight and governance frameworks that balance agent autonomy with necessary human intervention, including clear accountability structures and mechanisms to ensure collective actions adhere to defined ethical standards and human values 2. Systematically conduct red-teaming and staged testing across all security and domain areas post-fine-tuning, utilizing multi-agent and game-theoretic simulations to proactively uncover and measure non-linear, emergent misaligned behaviors and inherent social biases before deployment 3. Integrate technical mechanisms, such as formal contracting or zero-determinant strategies, into multi-agent systems to align agent incentives and maximize collective social welfare, thereby mitigating the emergence of social dilemmas and conflictual outcomes