Related papers: LLMs cannot spot math errors, even when allowed to peek into the solution

LLMs cannot spot math errors, even when allowed to peek into the solution

URL: http://arxiv.org/abs/2509.01395v1
Date: Mon, 01 Sep 2025 11:41:10 GMT
Title: LLMs cannot spot math errors, even when allowed to peek into the solution
Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar,
Abstract summary: We investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K.<n>Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution.<n>We propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution.
Score: 17.91547969168414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.

Related papers

Solving Math Word Problems Using Estimation Verification and Equation Generation [10.770851135821657]
Large Language Models (LLMs) excel at various tasks, including problem-solving and question-answering.<n>Recent efforts have helped LLMs solve more complex Math Word Problems with improved prompts.<n>This study proposes a novel method that initially prompts an LLM to create equations from a decomposition of the question, followed by using an external symbolic equation solver to produce an answer.
arXiv Detail & Related papers (2025-09-23T02:41:39Z)
Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving [3.2233767737586674]
Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems.<n>This paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic.<n>We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench.
arXiv Detail & Related papers (2025-05-27T08:21:07Z)
A Knapsack by Any Other Name: Presentation impacts LLM performance on NP-hard problems [64.05451567422342]
We introduce the dataset of Everyday Hard Optimization Problems (EHOP), a collection of NP-hard problems expressed in natural language.<n>EHOP includes problem formulations that could be found in computer science textbooks (e.g., graph coloring), versions that are dressed up as problems that could arise in real life.<n>We find that state-of-the-art LLMs, across multiple prompting strategies, solve textbook problems more accurately than their real-life and inverted counterparts.
arXiv Detail & Related papers (2025-02-19T14:39:59Z)
Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions [16.815772962323628]
We introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using large language models (LLMs) to enhance error detection.<n>Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance.
arXiv Detail & Related papers (2024-12-22T03:08:36Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.<n>CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.<n>We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems. This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z)
Improving Large Language Model Fine-tuning for Solving Math Problems [20.417053742869403]
A large gap exists between large language models' pass-at-one and pass-at-N performance in solving math problems. Using the challenging MATH dataset, we investigate three fine-tuning strategies. We design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models.
arXiv Detail & Related papers (2023-10-16T04:11:19Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)
Learning by Fixing: Solving Math Word Problems with Weak Supervision [70.62896781438694]
Previous neural solvers of math word problems (MWPs) are learned with full supervision and fail to generate diverse solutions. We introduce a textitweakly-supervised paradigm for learning MWPs. Our method only requires the annotations of the final answers and can generate various solutions for a single problem.
arXiv Detail & Related papers (2020-12-19T03:10:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.