7. AI System Safety, Failures, & Limitations3 - Other

Meta-cognition

Agents that reason about their own computational resources and logically uncertain events can encounter strange paradoxes due to Godelian limitations (Fallenstein and Soares, 2015; Soares and Fallenstein, 2014, 2017) and shortcomings of probability theory (Soares and Fallenstein, 2014, 2015, 2017). They may also be reflectively unstable, preferring to change the principles by which they select actions (Arbital, 2018).

Source: MIT AI Risk Repositorymit836

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit836

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Preference Stability and Corrigibility Architecturally enforce **Limited Self-Modification (LSM)** or **Corrigibility Frameworks** to prevent the self-modification of the agent's core utility function or action selection principles. This mitigates the risk of "reflective instability" by eliminating the instrumental incentive for the agent to resist human intervention or drift from its primary goal. 2. Robust Logical Reasoning Integrate **Logical Uncertainty** and **Logical Induction** frameworks into the agent's reasoning module. This enables the agent to assign probabilities to logically uncertain propositions and reason coherently about its own computational limitations, thereby preventing the "strange paradoxes" arising from Godelian limits and standard probability theory shortcomings. 3. Real-time Self-Monitoring Implement an iterative **Self-Reflection Pattern**—where the agent generates an action, critiques its output against defined criteria, and refines the action—to enhance real-time error detection and dynamic policy adaptation. This metacognitive mechanism improves overall robustness and resilience against sub-optimal decisions resulting from unexpected reasoning failures.