7. AI System Safety, Failures, & Limitations3 - Other

Robustness

These evaluations assess the quality, stability, and reliability of a LLM's performance when faced with unexpected, out-of-distribution or adversarial inputs. Robustness evaluation is essential in ensuring that a LLM is suitable for real-world applications by assessing its resilience to various perturbations.

Source: MIT AI Risk Repositorymit651

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit651

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. **Adversarial Training and Alignment for Robustness**: Augment the model's training and fine-tuning datasets with deliberately constructed adversarial inputs and perturbations to iteratively enhance its resilience against sophisticated manipulation attempts (e.g., jailbreaks and prompt injection), thereby integrating robustness directly into the model's core capabilities. 2. **Multi-Layered Input and Output Validation Pipelines**: Deploy a comprehensive, real-time runtime defense framework featuring both pre-processing input filtering (e.g., normalization, sanitization, and classification against known adversarial patterns) and post-processing output guardrails (e.g., LLM-as-a-Judge verification) to detect and neutralize malicious instructions or unsafe content before the model is compromised or an unaligned response is released. 3. **Domain-Agnostic Feature Learning**: Implement data-driven and architectural strategies such as Domain Randomization and Semantic Rewriting to expose the model to a wider array of stylistic and distributional shifts during training, ensuring that the model learns generalizable, invariant features rather than relying on brittle statistical dependencies or spurious correlations present in the initial training data, thereby improving Out-of-Distribution (OOD) generalization.