7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Intelligibility

How can we build agent’s whose decisions we can understand? Con- nects explainable decisions (Berkeley) and informed oversight (MIRI).

Source: MIT AI Risk Repositorymit833

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit833

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.4 > Lack of transparency or interpretability

Mitigation strategy

1. Prioritize the development and deployment of intrinsically interpretable AI models ("glass-box" architectures) or integrate validated post-hoc explanation techniques (e.g., SHAP, LIME) into complex systems *ab initio*. This technical measure ensures that a human-readable, verifiable justification is readily available for every decision, thereby transforming the computational "black-box" into a transparent process. 2. Design and rigorously test an "Informed Oversight" mechanism to ensure the agent's decisions align with the true utility function. This involves creating a robust second-order process or human-in-the-loop framework capable of reliably evaluating the outcome of the agent's actions against the intended utility, mitigating risks that arise when a highly capable agent is trained on a flawed or incomplete proxy reward signal. 3. Establish a formal governance framework that mandates systematic "algorithmic transparency." This requires complete documentation of the training data provenance, model architecture, and the specific interpretability metrics used. Pre-deployment auditing must verify that these transparency disclosures are sufficient for regulatory compliance and expert debugging, ensuring accountability before the system is operationalized.