Related papers: Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

URL: http://arxiv.org/abs/2512.00552v1
Date: Sat, 29 Nov 2025 16:47:01 GMT
Title: Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
Authors: Subramanyam Sahoo, Vinija Jain, Saanidhya Vats, Siddharth Mohapatra, Rui Min, Aman Chadha, Divya Chaudhary,
Abstract summary: We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching.<n>We reveal a striking disconnect between surface performance and reasoning fidelity.<n>Our diagnostics expose reasoning failures invisible to traditional accuracy metrics.
Score: 15.774418410083515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

Related papers

Robust Weighted Triangulation of Causal Effects Under Model Uncertainty [2.1793134762413433]
We develop a framework for causal effect triangulation that combines model testability methods with statistical inference methods.<n>We provide a bound on the distance of the functional from the true causal effect along with conditions under which this distance can be taken to zero.<n>Our framework formalizes robustness under causal pluralism without requiring agreement across models or commitment to a single specification.
arXiv Detail & Related papers (2026-03-01T14:09:34Z)
Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models [15.849480549367684]
We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
arXiv Detail & Related papers (2026-02-06T13:19:45Z)
EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs [9.412828452977553]
Existing approaches reinforce successful reasoning paths, incurring a substantial calibration cost.<n>This failure has been characterized as a form of model collapse in alignment.<n>We proposeEpiCaR as a training objective that jointly optimize reasoning performance and calibration.
arXiv Detail & Related papers (2026-01-11T06:21:13Z)
The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models [56.656180566692946]
We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models)<n>ThinkARM explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, verify, etc.<n>We show that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
arXiv Detail & Related papers (2025-12-23T02:44:25Z)
The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility [0.0]
Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
arXiv Detail & Related papers (2025-11-23T05:49:57Z)
From Black-box to Causal-box: Towards Building More Interpretable Models [57.23201263629627]
We introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a specific class of models.<n>We derive a complete graphical criterion that determines whether a given model architecture supports a given counterfactual query.
arXiv Detail & Related papers (2025-10-24T20:03:18Z)
Systematic Diagnosis of Brittle Reasoning in Large Language Models [1.14219428942199]
A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics.<n>We propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points.
arXiv Detail & Related papers (2025-10-05T21:40:09Z)
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models [11.250861762443801]
We introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems.<n>We thoroughly evaluate advanced models' performance on it.<n>Our analysis uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models.
arXiv Detail & Related papers (2025-06-20T16:14:18Z)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments [5.5855749614100825]
This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction.<n>We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem.<n>Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
arXiv Detail & Related papers (2025-05-25T23:17:47Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
On the Reasoning Capacity of AI Models and How to Quantify It [0.0]
Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities.<n>While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks.<n>We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior.
arXiv Detail & Related papers (2025-01-23T16:58:18Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)
Estimation of Bivariate Structural Causal Models by Variational Gaussian Process Regression Under Likelihoods Parametrised by Normalising Flows [74.85071867225533]
Causal mechanisms can be described by structural causal models. One major drawback of state-of-the-art artificial intelligence is its lack of explainability.
arXiv Detail & Related papers (2021-09-06T14:52:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.