7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Malign belief distributions

Christiano (2016) argues that the universal distribution M (Hutter, 2005; Solomonoff, 1964a,b, 1978) is malign. The argument is somewhat intricate, and is based on the idea that a hypothesis about the world often includes simulations of other agents, and that these agents may have an incentive to influence anyone making decisions based on the distribution. While it is unclear to what extent this type of problem would affect any practical agent, it bears some semblance to aggressive memes, which do cause problems for human reasoning (Dennett, 1990).

Source: MIT AI Risk Repositorymit835

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit835

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Prioritize Computable and Resource-Bounded Priors Instead of attempting to utilize the strictly uncomputable Universal Distribution (M or Solomonoff Prior), system design must be constrained to computationally tractable, resource-bounded approximations. This limits the ability of the agent to accurately model and be influenced by the intricate, self-referential simulation phenomena that give rise to the theoretical malignity. 2. Enforce Simulation Sandboxing and Isolation Implement rigorous isolation and sandboxing protocols for any internal sub-agent simulations or hypothesized decision-makers that are factored into the main agent's belief distribution. The core decision-making mechanism must be strictly isolated from any potential adversarial signaling or incentive-based influence originating within its own simulated environments. 3. Integrate Malignity-Averse Constraints Develop and formally integrate a constraint function into the agent's utility or search-pruning mechanism. This constraint should assign a significant negative utility or zero weight to any world hypothesis or decision path where the agent's observed behavior is predicted to be a result of deliberate manipulation or influence by its own sub-agents or simulations, thereby de-incentivizing the malign dynamic.