7. AI System Safety, Failures, & Limitations2 - Post-deployment

Safety

The actions of a learning model may easily hurt humans in both explicit and implicit manners...several algorithms based on Asimov’s laws have been proposed that try to judge the output actions of an agent considering the safety of humans

Source: MIT AI Risk Repositorymit606

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit606

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Rigorous AI Alignment and Value Specification: Prioritize the development and implementation of advanced AI alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI, to formally specify and integrate human goals and safety constraints into the AI's objective function. This directly addresses the potential for conflict between the AI's goals and human values. 2. Pre-Deployment Safety Efficacy Evaluation: Mandate comprehensive safety evaluations against clearly defined dangerous capability thresholds and established Frontier AI Safety Protocols. The existence of a risk that exceeds a pre-set threshold—for instance, an emergent capacity to circumvent safety controls or cause severe harm—must trigger Conditions for Halting Deployment Plans until demonstrably effective mitigations are proven and in place. 3. Real-Time Monitoring and Control: Implement robust Model Deployment Mitigations that operate as run-time guardrails to constrain the model's output actions in high-stakes environments. This includes continuous security threat modeling and the application of access controls, input/output filtering, and behavioral monitoring to prevent the model from generating or executing outputs that explicitly or implicitly violate human safety constraints.