Large Language Models and Mathematical Reasoning Failures
- URL: http://arxiv.org/abs/2502.11574v1
- Date: Mon, 17 Feb 2025 09:07:32 GMT
- Title: Large Language Models and Mathematical Reasoning Failures
- Authors: Johan Boye, Birger Moell,
- Abstract summary: This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.
We rigorously analyze both final answers and solution steps to identify reasoning failures.
We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
- Score: 1.6114012813668932
- License:
- Abstract: This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.
Related papers
- Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [0.0]
We introduce GSM-Ranges, a dataset generator that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales.
We also propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy.
arXiv Detail & Related papers (2025-02-12T09:53:10Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.
We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.
JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.
Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.
We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning [11.63133816413199]
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs)
We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models.
We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.
arXiv Detail & Related papers (2024-06-16T08:06:05Z) - Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning [30.40415945003794]
We investigate the compositionality of large language models (LLMs) in mathematical reasoning.
Since problems with logical flaws are quite rare in the real world, these represent "unseen" cases to LLMs.
Our experiments show that while LLMs possess both components of requisite knowledge, they do not textbfspontaneously combine them to handle these novel cases.
arXiv Detail & Related papers (2024-05-05T16:35:30Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts.
This benchmark is difficult, with the best model only scoring 58.1% in standard settings.
We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z) - SMART: A Situation Model for Algebra Story Problems via Attributed
Grammar [74.1315776256292]
We introduce the concept of a emphsituation model, which originates from psychology studies to represent the mental states of humans in problem-solving.
We show that the proposed model outperforms all previous neural solvers by a large margin while preserving much better interpretability.
arXiv Detail & Related papers (2020-12-27T21:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.