Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs
- URL: http://arxiv.org/abs/2601.06423v1
- Date: Sat, 10 Jan 2026 04:20:00 GMT
- Title: Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs
- Authors: Deep Mehta,
- Abstract summary: Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks.<n>We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness?<n>We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness? We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar's tests for paired comparisons, and Cohen's d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency. GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212). Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs.
Related papers
- When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning [16.505918019260964]
We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable predictions.<n>We show that 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways.
arXiv Detail & Related papers (2026-03-03T19:43:36Z) - Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets [0.0]
A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80% confidence, it should be correct 80% of the time.<n>We introduce textbfKalshiBench, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange.<n>We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find textbfsystematic overconfidence across all models.
arXiv Detail & Related papers (2025-12-17T23:23:06Z) - Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z) - VeriThinker: Learning to Verify Makes Reasoning Model Efficient [52.74493506816969]
Large Reasoning Models excel at complex tasks using Chain-of-Thought (CoT) reasoning.<n>Their tendency to overthinking leads to unnecessarily lengthy reasoning chains.<n>We introduce VeriThinker, a novel approach for CoT compression.
arXiv Detail & Related papers (2025-05-23T14:17:56Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Accurate and Reliable Predictions with Mutual-Transport Ensemble [46.368395985214875]
We propose a co-trained auxiliary model and adaptively regularizes the cross-entropy loss using Kullback-Leibler (KL)
We show that MTE can simultaneously enhance both accuracy and uncertainty calibration.
For example, on the CIFAR-100 dataset, our MTE method on ResNet34/50 achieved significant improvements compared to previous state-of-the-art method.
arXiv Detail & Related papers (2024-05-30T03:15:59Z) - Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors [3.350980549219263]
certified training improves robustness but also decreases accuracy noticeably.
It is not clear whether there is a certain fundamental limit on achieving robustness whilst maintaining accuracy.
By adopting Bayes error to robustness analysis, we investigate the limit of certified robust accuracy.
arXiv Detail & Related papers (2024-05-19T13:23:05Z) - BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability [35.743903178120895]
BaRDa dataset contains 3000 entailments (1787 valid, 1213 invalid)
We find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2.
This shows the clear progression of models towards improved factual accuracy and entailment reasoning.
arXiv Detail & Related papers (2023-12-12T18:55:43Z) - Show Your Work with Confidence: Confidence Bands for Tuning Curves [51.12106543561089]
tuning curves plot validation performance as a function of tuning effort.
We present the first method to construct valid confidence bands for tuning curves.
We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method.
arXiv Detail & Related papers (2023-11-16T00:50:37Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.