Related papers: Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

URL: http://arxiv.org/abs/2505.23701v1
Date: Thu, 29 May 2025 17:37:57 GMT
Title: Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Authors: Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, Jackie Chi Kit Cheung,
Abstract summary: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems.<n>We show that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step.<n>Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps.
Score: 39.50584137093401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Related papers

Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z)
Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z)
Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines [7.695524275630717]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing and reasoning tasks. We propose a Composable Arithmetic Execution Framework (CAEF) that enables LLMs to learn to execute step-by-step computations by emulating Turing Machines. In our evaluation, CAEF achieves nearly 100% accuracy across seven common mathematical operations on the LLaMA 3.1-8B model.
arXiv Detail & Related papers (2024-10-10T13:23:49Z)
Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model [15.542737858152053]
We propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD) to mitigate misunderstanding errors. KPDD enhances the reasoning performance of SLMs by breaking down the problem-solving process into three stages. Experiments show KPDD-CoT significantly improves reasoning abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical reasoning tasks.
arXiv Detail & Related papers (2024-07-14T11:41:03Z)
Quantum Many-Body Physics Calculations with Large Language Models [7.679615503214482]
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method. We design multi-step prompt templates that break down the analytic calculation into standardized steps. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade.
arXiv Detail & Related papers (2024-03-05T17:47:22Z)
Efficient Computation of Counterfactual Bounds [44.4263314637532]
We compute exact counterfactual bounds via algorithms for credal nets on a subclass of structural causal models. We evaluate their accuracy by providing credible intervals on the quality of the approximation.
arXiv Detail & Related papers (2023-07-17T07:59:47Z)
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models. We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z)
Learning Multi-Step Reasoning by Solving Arithmetic Tasks [6.398022050054328]
This work investigates how to incorporate relatively small Language Models with the capabilities of multi-step reasoning. We propose to inject such abilities by continually pre-training LMs on a synthetic dataset MsAT. Our experiments on four math word problem datasets show the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-06-02T17:29:22Z)
Does entity abstraction help generative Transformers reason? [8.159805544989359]
We study the utility of incorporating entity type abstractions into pre-trained Transformers. We test these methods on four NLP tasks requiring different forms of logical reasoning.
arXiv Detail & Related papers (2022-01-05T19:00:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.