Proxy misspecification
AI agents are directed by goals and objectives. Creating general-purpose objectives that capture human values could be challenging... Since goal-directed AI systems need measurable objectives, by default our systems may pursue simplified proxies of human values. The result could be suboptimal or even catastrophic if a sufficiently powerful AI successfully optimizes its flawed objective to an extreme degree
ENTITY
3 - Other
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit572
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Reduce Misspecification Gap ($\\epsilon$): Implement richer supervisory mechanisms, constitutional principles, or diverse, aggregated feedback loops to minimize the disparity between the proxy reward function and the true, complex human values the system is intended to maximize. 2. Moderate Optimization Pressure (P): Apply techniques such as entropy regularization or multi-objective reward balancing to temper the optimization process and prevent powerful AI systems from pursuing the flawed proxy objective to an extreme or runaway degree, a phenomenon predicted by Goodhart's law. 3. Improve Robustness and Policy Constraint: Utilize conservative search methods (e.g., $\\delta$-Conservative Search) or train robust surrogate models with regularization against adversarial inputs to constrain policy exploration to regions where the proxy model's evaluations are demonstrably reliable, thereby mitigating misspecification on out-of-distribution inputs.