7. AI System Safety, Failures, & Limitations2 - Post-deployment

Technical vulnerabilities (Robustness - unexpected behaviour)

There is no assurance that generative AI models will consistently behave as their developers and users intend. Unwanted content is not necessarily due to intentional adversarial behavior. Generative AI models can unexpectedly produce potentially harmful content, including materials that are racist, discriminatory, or sexually explicit, or that promote violence, terrorism, or hate.

Source: MIT AI Risk Repositorymit723

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit723

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Establish robust, multi-layered post-processing validation mechanisms to systematically filter and redact generated content for violations of safety policies, ethical standards, and legal compliance (e.g., hate speech, discriminatory outputs) prior to delivery to the end-user. 2. Employ advanced adversarial training and alignment techniques during the model development lifecycle to proactively enhance intrinsic model robustness, thereby minimizing susceptibility to unintended behavioral outputs and ensuring greater adherence to developers' safety and ethical intentions. 3. Implement a continuous monitoring and auditing framework to track input-output logs, performance metrics, and system behavior in real-time, facilitating the rapid detection of emergent failure modes, unintended biases, or model drift that may lead to the generation of harmful content.