Related papers: Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

URL: http://arxiv.org/abs/2506.24006v1
Date: Mon, 30 Jun 2025 16:10:42 GMT
Title: Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Authors: Anselm R. Strohmaier, Wim Van Dooren, Kathrin Seßler, Brian Greer, Lieven Verschaffel,
Abstract summary: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education.<n>Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems.<n>But their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear.
Score: 0.6990493129893112
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.

Related papers

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective [68.94793547575343]
CogMath formalizes human reasoning process into 3 stages: emphproblem comprehension, emphproblem solving, and emphsolution summarization.<n>In each dimension, we develop an emphInquiry-emphJudge-emphReference'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension.<n>An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions.
arXiv Detail & Related papers (2025-06-04T22:00:52Z)
MathConstruct: Challenging LLM Reasoning with Constructive Proofs [0.9320657506524149]
mc is a new benchmark of 126 challenging problems sourced from various math competitions.<n>mc is suitable for Large Language Models evaluation, as solution correctness can be easily verified.
arXiv Detail & Related papers (2025-02-14T14:44:22Z)
Give me a hint: Can LLMs take a hint to solve math problems? [0.5742190785269342]
We propose giving "hints" to improve the language model's performance on advanced mathematical problems. We also test robustness to adversarial hints and demonstrate their sensitivity to them.
arXiv Detail & Related papers (2024-10-08T11:09:31Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula [25.549869705051606]
We investigate whether language models' (LMs) mathematical abilities can discern skills and concepts enabled by math content. We develop two tasks for evaluating LMs' abilities to assess math problems. We find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways.
arXiv Detail & Related papers (2024-08-08T05:28:34Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models [44.63505885248145]
FineMath is a fine-grained mathematical evaluation benchmark dataset for assessing Chinese Large Language Models (LLMs) FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are divided into 17 categories of math word problems. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems.
arXiv Detail & Related papers (2024-03-12T15:32:39Z)
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z)
Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning [4.376598435975689]
We discuss the challenges associated with employing large language models to enhance students' mathematical problem-solving skills. LLMs can generate the wrong reasoning processes, and also exhibit difficulty in understanding the given questions' rationales when attempting to correct students' answers.
arXiv Detail & Related papers (2023-10-20T16:05:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.