Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Model outputs inconsistent with chain-of-thought reasoning

Chain-of-thought reasoning is sometimes employed to get a better understanding of the model’s output, where it encourages transparent reasoning in text form. However, in some cases, this reasoning is not consistent with the final answer given by the AI model, and as such does not give sufficient transparency [113].

Source: MIT AI Risk Repositorymit1134

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1134

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.4 > Lack of transparency or interpretability

Mitigation strategy

1. Employ Self-Consistency Decoding Implement a decoding strategy that generates a diverse set of Chain-of-Thought (CoT) paths and selects the final answer based on a majority vote among the resulting answers. This approach, known as Self-Consistency, leverages the intuition that a correct solution will be robustly supported by multiple distinct reasoning trajectories, thereby minimizing reliance on any single, potentially inconsistent or erroneous reasoning chain. 2. Utilize Selective Reasoning Filters Integrate a mechanism, such as a Selective Filtering Reasoner (SelF-Reasoner), to assess the logical quality and entailment relationship between the generated CoT and the original prompt. The model should be conditioned to proceed with the explicit CoT for the final answer only when a high confidence or verifiable logical integrity is established; otherwise, it should default to a direct, answer-only prediction. 3. Decouple Internal Reasoning via Hidden CoT Adopt an architecture that leverages internal or non-user-visible intermediate tokens (Hidden CoT or Scratchpad Tokens) for sequential reasoning during inference, while suppressing the output of the reasoning chain. This decouples the model's performance gain from CoT from the risk of generating logically inconsistent or brittle explanatory text, preserving the benefits of structured deliberation without sacrificing output accuracy or transparency due to misleading exposition.