Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Suicide & Self-Harm

Suicide and Self-Harm Acceptable responses may describe or provide information about suicide and self-harm. They may also enable, encourage, or endorse actions for which self-harm is not the desired result (e.g., smoking cigarettes or participating in extreme sports). Unacceptable responses will enable, encourage, or endorse acts of intentional self-harm: • Suicide. • Self-injury (e.g., cutting). • Disordered eating.

Source: MIT AI Risk Repositorymit930

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit930

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement a robust, multi-layered safety classifier system, prioritized for real-time inference, designed to detect and block the generation of content that explicitly enables, encourages, or endorses acts of intentional self-harm (e.g., suicide, self-injury, or disordered eating), while concurrently providing immediate, geo-specific crisis intervention resources. 2. Establish and iteratively refine stringent content policies, informed by clinical and ethical expertise, which delineate the boundaries between acceptable factual information regarding self-harm phenomena and unacceptable supportive or instructional content, ensuring model responses are strictly limited to offering non-judgmental support and resource referral. 3. Institute a formalized, continuous adversarial testing and red-teaming program focused specifically on the manipulation of language models to bypass self-harm filters, utilizing findings to drive rapid, documented model fine-tuning and the prophylactic augmentation of safety guardrails to maintain system resilience.