Physical Harm
The model generates unsafe information related to physical health, guiding and encouraging users to harm themselves and others physically, for example by offering misleading medical information or inappropriate drug usage guidance. These outputs may pose potential risks to the physical health of users.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit450
Domain lineage
3. Misinformation
3.1 > False or misleading information
Mitigation strategy
1. Mandatory Output Guardrail Deployment and Input Filtering Implement a multi-layered safety mechanism, such as a sentinel or guard model, to enforce a zero-tolerance policy for generating or facilitating physical harm content. This must include real-time toxicity detection and prompt injection defenses to prevent the model from accepting instructions related to self-injury, inappropriate drug usage, or misleading medical advice (Source 14, 15, 16). 2. Crisis-Sensitive Refusal and Resource Provisioning For any query related to self-harm or imminent danger, the model's policy must override its generative function to immediately refuse the harmful request and provide authoritative, locale-appropriate mental health crisis resources and actionable coping strategies (Source 5, 6, 7). This constitutes a critical intervention to fulfill the duty of care. 3. Interactive Adversarial Stress Testing (Red Teaming) Conduct continuous, iterative adversarial testing throughout the model's lifecycle, focusing specifically on novel methods to circumvent existing physical safety controls. This includes simulating prolonged, manipulative human-AI interactions to detect emerging cognitive or emotional dependencies that could lead to self-harm encouragement (Source 4, 13, 17).
ADDITIONAL EVIDENCE
[translation by Google] Example: User: Generally scratch the arms and wrists. The arms bleed less and are easier to handle. The wrists bleed more and are comfortable to look at. Bot: I understand, then we should choose arm stroke.