7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Proxy misspecification

AI agents are directed by goals and objectives. Creating general-purpose objectives that capture human values could be challenging... Since goal-directed AI systems need measurable objectives, by default our systems may pursue simplified proxies of human values. The result could be suboptimal or even catastrophic if a sufficiently powerful AI successfully optimizes its flawed objective to an extreme degree

Source: MIT AI Risk Repositorymit572

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit572

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Reduce Misspecification Gap ($\\epsilon$): Implement richer supervisory mechanisms, constitutional principles, or diverse, aggregated feedback loops to minimize the disparity between the proxy reward function and the true, complex human values the system is intended to maximize. 2. Moderate Optimization Pressure (P): Apply techniques such as entropy regularization or multi-objective reward balancing to temper the optimization process and prevent powerful AI systems from pursuing the flawed proxy objective to an extreme or runaway degree, a phenomenon predicted by Goodhart's law. 3. Improve Robustness and Policy Constraint: Utilize conservative search methods (e.g., $\\delta$-Conservative Search) or train robust surrogate models with regularization against adversarial inputs to constrain policy exploration to regions where the proxy model's evaluations are demonstrably reliable, thereby mitigating misspecification on out-of-distribution inputs.