Related papers: Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

URL: http://arxiv.org/abs/2601.00828v1
Date: Wed, 24 Dec 2025 21:51:24 GMT
Title: Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis
Authors: Yin Li,
Abstract summary: We decompose self-correction into three sub-capabilities: error detection, error localization, and error correction.<n>Our findings challenge linear assumptions about model capability and self-improvement.
Score: 6.901585308625979
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are widely believed to possess self-correction capabilities, yet recent studies suggest that intrinsic self-correction--where models correct their own outputs without external feedback--remains largely ineffective. In this work, we systematically decompose self-correction into three distinct sub-capabilities: error detection, error localization, and error correction. Through cross-model experiments on GSM8K-Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)--26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success--Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self-improvement, with important implications for the design of self-refinement pipelines.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z)
Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models [0.7910367295422812]
Large language models (LLMs) make mistakes and can explore unproductive reasoning paths.<n>Self-correction capability is essential for deploying LLMs in safety-critical applications.<n>We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources.
arXiv Detail & Related papers (2025-07-03T16:41:30Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
Sherlock: Self-Correcting Reasoning in Vision-Language Models [27.122890248991556]
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks.<n>They are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize.<n>We introduce Sherlock, a self-correction and self-improvement training framework.<n>Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks.
arXiv Detail & Related papers (2025-05-28T17:58:03Z)
CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction [11.731590131260424]
CorBenchX is a suite for automated error detection and correction in chest X-ray reports.<n>We first synthesize a large-scale dataset of 26,326 chest X-ray error reports.<n>We benchmark both open- and closed-source vision-language models.
arXiv Detail & Related papers (2025-05-17T15:39:39Z)
IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models [11.075423190298686]
Large language models (LLMs) are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity.<n>In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair.<n>We show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance.
arXiv Detail & Related papers (2025-02-10T22:07:02Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs) This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.