Related papers: Systematic Diagnosis of Brittle Reasoning in Large Language Models

Systematic Diagnosis of Brittle Reasoning in Large Language Models

URL: http://arxiv.org/abs/2510.08595v1
Date: Sun, 05 Oct 2025 21:40:09 GMT
Title: Systematic Diagnosis of Brittle Reasoning in Large Language Models
Authors: V. S. Raghu Parupudi,
Abstract summary: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics.<n>We propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points.
Score: 1.14219428942199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

Related papers

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity [15.774418410083515]
We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching.<n>We reveal a striking disconnect between surface performance and reasoning fidelity.<n>Our diagnostics expose reasoning failures invisible to traditional accuracy metrics.
arXiv Detail & Related papers (2025-11-29T16:47:01Z)
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling [60.63703438729223]
We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
arXiv Detail & Related papers (2025-08-22T18:57:08Z)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments [5.5855749614100825]
This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction.<n>We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem.<n>Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
arXiv Detail & Related papers (2025-05-25T23:17:47Z)
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models [27.437685534830457]
Large language models frequently exhibit a problematic reliance on familiar reasoning patterns.<n>Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories.<n>This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle.
arXiv Detail & Related papers (2025-05-22T19:00:01Z)
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education [24.970741456147447]
Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K.<n>However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation.<n>We introduce textbfMathCCS, a benchmark designed for systematic error analysis and tailored feedback.<n>Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision.<n>Third, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-
arXiv Detail & Related papers (2025-02-19T14:57:51Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations [80.86128012438834]
We show for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete. We propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees.
arXiv Detail & Related papers (2024-07-10T09:13:11Z)
Unified Explanations in Machine Learning Models: A Perturbation Approach [0.0]
Inconsistencies between XAI and modeling techniques can have the undesirable effect of casting doubt upon the efficacy of these explainability approaches. We propose a systematic, perturbation-based analysis against a popular, model-agnostic method in XAI, SHapley Additive exPlanations (Shap) We devise algorithms to generate relative feature importance in settings of dynamic inference amongst a suite of popular machine learning and deep learning methods, and metrics that allow us to quantify how well explanations generated under the static case hold.
arXiv Detail & Related papers (2024-05-30T16:04:35Z)
Modeling Boundedly Rational Agents with Latent Inference Budgets [56.24971011281947]
We introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. We show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty.
arXiv Detail & Related papers (2023-12-07T03:55:51Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.