Error Detection and Correction for Interpretable Mathematics in Large Language Models
- URL: http://arxiv.org/abs/2508.03500v1
- Date: Tue, 05 Aug 2025 14:30:35 GMT
- Title: Error Detection and Correction for Interpretable Mathematics in Large Language Models
- Authors: Yijin Yang, Cristina Cornelio, Mario Leiva, Paulo Shakarian,
- Abstract summary: EDCIM (Error Detection and Correction for Interpretable Mathematics) is a method for detecting and correcting these errors in interpretable mathematics tasks.<n>It integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy.<n> Experimental results show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy.
- Score: 5.258949636570995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.
Related papers
- EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning [66.82956219777763]
Large Language Models (LLMs) have demonstrated strong reasoning capabilities.<n>Error-IndUced LEaRning (EULER) model aims to develop an error exposure model that generates high-quality solution errors.
arXiv Detail & Related papers (2025-05-28T08:57:03Z) - LEMMA: Learning from Errors for MatheMatical Advancement in LLMs [33.571479131705075]
We introduce Learning from Errors for Mathematical Advancement (LEMMA) to enhance large language models' reasoning ability.<n> LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning.<n> Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.
arXiv Detail & Related papers (2025-03-21T17:59:10Z) - The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It [23.803612556616685]
We present a mechanistic analysis of error detection in large language models (LLMs)<n>Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs.<n>Our findings reveal that all models heavily rely on $textitconsistency heads$--attention heads that assess surface-level alignment of numerical values in arithmetic solutions.
arXiv Detail & Related papers (2025-02-17T13:00:44Z) - Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.