Intelligibility
How can we build agent’s whose decisions we can understand? Con- nects explainable decisions (Berkeley) and informed oversight (MIRI).
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit833
Domain lineage
7. AI System Safety, Failures, & Limitations
7.4 > Lack of transparency or interpretability
Mitigation strategy
1. Prioritize the development and deployment of intrinsically interpretable AI models ("glass-box" architectures) or integrate validated post-hoc explanation techniques (e.g., SHAP, LIME) into complex systems *ab initio*. This technical measure ensures that a human-readable, verifiable justification is readily available for every decision, thereby transforming the computational "black-box" into a transparent process. 2. Design and rigorously test an "Informed Oversight" mechanism to ensure the agent's decisions align with the true utility function. This involves creating a robust second-order process or human-in-the-loop framework capable of reliably evaluating the outcome of the agent's actions against the intended utility, mitigating risks that arise when a highly capable agent is trained on a flawed or incomplete proxy reward signal. 3. Establish a formal governance framework that mandates systematic "algorithmic transparency." This requires complete documentation of the training data provenance, model architecture, and the specific interpretability metrics used. Pre-deployment auditing must verify that these transparency disclosures are sufficient for regulatory compliance and expert debugging, ensuring accountability before the system is operationalized.