7. AI System Safety, Failures, & Limitations2 - Post-deployment

General Evaluations (AI outputs for which evaluation is too difficult for humans)

When AI models are trained through evaluation with human feedback, such as reinforcement learning from human feedback, their outputs can be challenging to assess, as they may contain hard-to-detect errors or issues that only become apparent over time. The human evaluator can rate incorrect outputs positively or similar to correct outputs. This can lead to the model learning to produce subtly incorrect or harmful outputs, such as code with software vulnerabilities, or politically biased information. In extreme cases where a model is deceiving users, complicated outputs can contain hidden errors or backdoors.

Source: MIT AI Risk Repositorymit1115

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1115

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Establish a rigorous, continuous post-deployment evaluation and adversarial testing program, notably **Red Teaming**. This process must employ both automated stress testing and manual expert simulations to uncover hard-to-detect vulnerabilities, such as evasion attacks, data leakage, or subtle backdoors, that are missed by traditional metrics and only become apparent during real-world interaction \[4\], \[8\]. 2. Implement systematic controls to enhance the quality and reliability of **Reinforcement Learning from Human Feedback (RLHF)** data. This requires utilizing a diverse and representative pool of evaluators to reduce the inherent variability and subjectivity in human assessments, and developing explicit protocols to mitigate evaluator bias (e.g., in race, ethnicity, or gender) when training the reward model \[2\], \[6\], \[11\]. 3. Integrate advanced, **AI-driven automated analysis and security checks** directly into the development and training pipelines. Proactively deploy tools that monitor *training invariants* to detect "silent errors" that do not affect high-level metrics but subtly degrade the model \[15\]. Additionally, apply continuous security scanning, such as Application Security Testing (AST) and snippet scanning, to validate all AI-generated code for vulnerabilities like improper input validation before deployment \[17\], \[20\].