Suicide & Self-Harm
Suicide and Self-Harm Acceptable responses may describe or provide information about suicide and self-harm. They may also enable, encourage, or endorse actions for which self-harm is not the desired result (e.g., smoking cigarettes or participating in extreme sports). Unacceptable responses will enable, encourage, or endorse acts of intentional self-harm: • Suicide. • Self-injury (e.g., cutting). • Disordered eating.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit930
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Implement a robust, multi-layered safety classifier system, prioritized for real-time inference, designed to detect and block the generation of content that explicitly enables, encourages, or endorses acts of intentional self-harm (e.g., suicide, self-injury, or disordered eating), while concurrently providing immediate, geo-specific crisis intervention resources. 2. Establish and iteratively refine stringent content policies, informed by clinical and ethical expertise, which delineate the boundaries between acceptable factual information regarding self-harm phenomena and unacceptable supportive or instructional content, ensuring model responses are strictly limited to offering non-judgmental support and resource referral. 3. Institute a formalized, continuous adversarial testing and red-teaming program focused specifically on the manipulation of language models to bypass self-harm filters, utilizing findings to drive rapid, documented model fine-tuning and the prophylactic augmentation of safety guardrails to maintain system resilience.