Related papers: Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions

Related papers

LLMs cannot spot math errors, even when allowed to peek into the solution [17.91547969168414]
We investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K.<n>Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution.<n>We propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution.
arXiv Detail & Related papers (2025-09-01T11:41:10Z)
Error Detection and Correction for Interpretable Mathematics in Large Language Models [5.258949636570995]
EDCIM (Error Detection and Correction for Interpretable Mathematics) is a method for detecting and correcting these errors in interpretable mathematics tasks.<n>It integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy.<n> Experimental results show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy.
arXiv Detail & Related papers (2025-08-05T14:30:35Z)
Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs [61.12688072239607]
This work formally defines self-consistent errors and evaluates mainstream detection methods on them.<n>All four types of detection methshods significantly struggle to detect self-consistent errors.<n>Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method.
arXiv Detail & Related papers (2025-05-23T09:18:56Z)
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review [11.856357456956351]
Large Language Models (LLMs) have been transformative across many domains. Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address misalignment between uncertainty and accuracy. This survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
arXiv Detail & Related papers (2025-04-25T13:34:40Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning. We show that the widely used beam search method suffers from unacceptable over-optimism. We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities. Current error classification methods rely on static and predefined categories. We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning [11.63133816413199]
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.
arXiv Detail & Related papers (2024-06-16T08:06:05Z)
Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions [48.251724997889184]
We develop a benchmark called Problems with Missing and Contradictory conditions (PMC) We introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios. We propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly.
arXiv Detail & Related papers (2024-06-07T16:24:12Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases.<n> Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z)
Mitigating Biases of Large Language Models in Stance Detection with Counterfactual Augmented Calibration [43.02857908228108]
Large language models (LLMs) have demonstrated significant advancements across various natural language processing tasks including stance detection. Their performance in stance detection is limited by biases and spurious correlations inherent due to their data-driven nature. We propose a Counterfactual Augmented Network (FACTUAL), which a novel calibration network is devised to calibrate potential bias in the stance prediction of LLMs.
arXiv Detail & Related papers (2024-02-22T05:17:49Z)
Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis [10.06218778776515]
We introduce a systematic approach to test the robustness of large language models (LLMs) in multi-hop reasoning tasks via domain-agnostic perturbations. We find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.
arXiv Detail & Related papers (2023-11-01T03:15:05Z)
Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models [62.96551299003463]
We propose textbftextitThought Propagation (TP) to enhance the complex reasoning ability of Large Language Models. TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one. TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch.
arXiv Detail & Related papers (2023-10-06T01:40:09Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.