Related papers: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

URL: http://arxiv.org/abs/2507.02778v1
Date: Thu, 03 Jul 2025 16:41:30 GMT
Title: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Authors: Ken Tsui,
Abstract summary: Self-correction is an important capability for large language models (LLMs)<n>While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot'<n>Testing 14 models, we find an average 64.5% blind spot rate.<n>Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

Related papers

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs [61.12688072239607]
This work formally defines self-consistent errors and evaluates mainstream detection methods on them.<n>All four types of detection methshods significantly struggle to detect self-consistent errors.<n>Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method.
arXiv Detail & Related papers (2025-05-23T09:18:56Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling [9.305763502526833]
We propose an accountability model for task-oriented dialogue agents to address user overreliance via friction turns.<n>Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions.
arXiv Detail & Related papers (2025-01-17T17:40:12Z)
ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training [14.178223242134166]
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks.<n>LLMs are susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs.<n>We propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs.
arXiv Detail & Related papers (2024-10-15T15:52:45Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
Large Language Models have Intrinsic Self-Correction Ability [18.79203446847577]
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks.<n>One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.<n>In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z)
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales [29.33581578047835]
SaySelf is a training framework that teaches large language models to express more accurate fine-grained confidence estimates. In addition, SaySelf directs LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration.
arXiv Detail & Related papers (2024-05-31T16:21:16Z)
Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs) This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.