7. AI System Safety, Failures, & Limitations3 - Other

Collectively Harmful Behaviors

AI systems have the potential to take actions that are seemingly benignin isolation but become problematic in multi-agent or societal contexts. Classical game theory offers simplistic models for understanding these behaviors. For instance, Phelps and Russell (2023) evaluates GPT-3.5's performance in the iterated prisoner's dilemma and other social dilemmas, revealing limitations in themodel's cooperative capabilities.

Source: MIT AI Risk Repositorymit567

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit567

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Establish robust, scalable oversight and governance frameworks that balance agent autonomy with necessary human intervention, including clear accountability structures and mechanisms to ensure collective actions adhere to defined ethical standards and human values 2. Systematically conduct red-teaming and staged testing across all security and domain areas post-fine-tuning, utilizing multi-agent and game-theoretic simulations to proactively uncover and measure non-linear, emergent misaligned behaviors and inherent social biases before deployment 3. Integrate technical mechanisms, such as formal contracting or zero-determinant strategies, into multi-agent systems to align agent incentives and maximize collective social welfare, thereby mitigating the emergence of social dilemmas and conflictual outcomes