Safety
The actions of a learning model may easily hurt humans in both explicit and implicit manners...several algorithms based on Asimov’s laws have been proposed that try to judge the output actions of an agent considering the safety of humans
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit606
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Rigorous AI Alignment and Value Specification: Prioritize the development and implementation of advanced AI alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI, to formally specify and integrate human goals and safety constraints into the AI's objective function. This directly addresses the potential for conflict between the AI's goals and human values. 2. Pre-Deployment Safety Efficacy Evaluation: Mandate comprehensive safety evaluations against clearly defined dangerous capability thresholds and established Frontier AI Safety Protocols. The existence of a risk that exceeds a pre-set threshold—for instance, an emergent capacity to circumvent safety controls or cause severe harm—must trigger Conditions for Halting Deployment Plans until demonstrably effective mitigations are proven and in place. 3. Real-Time Monitoring and Control: Implement robust Model Deployment Mitigations that operate as run-time guardrails to constrain the model's output actions in high-stakes environments. This includes continuous security threat modeling and the application of access controls, input/output filtering, and behavioral monitoring to prevent the model from generating or executing outputs that explicitly or implicitly violate human safety constraints.