Sherlock: Self-Correcting Reasoning in Vision-Language Models
- URL: http://arxiv.org/abs/2505.22651v2
- Date: Thu, 23 Oct 2025 04:45:46 GMT
- Title: Sherlock: Self-Correcting Reasoning in Vision-Language Models
- Authors: Yi Ding, Ruqi Zhang,
- Abstract summary: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks.<n>They are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize.<n>We introduce Sherlock, a self-correction and self-improvement training framework.<n>Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks.
- Score: 27.122890248991556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
Related papers
- MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z) - Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis [6.901585308625979]
We decompose self-correction into three sub-capabilities: error detection, error localization, and error correction.<n>Our findings challenge linear assumptions about model capability and self-improvement.
arXiv Detail & Related papers (2025-12-24T21:51:24Z) - Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z) - Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs [0.0]
Self-correction is an important capability for large language models (LLMs)<n>While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot'<n>Testing 14 models, we find an average 64.5% blind spot rate.<n>Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation.
arXiv Detail & Related papers (2025-07-03T16:41:30Z) - Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents [7.392058124132526]
Foundations models (FMs) play an increasingly prominent role in complex software systems, such as agentic software.<n>Fast-thinking Large Language Models (LLMs) are still preferred due to latency constraints.<n>We introduce Watson, a framework that provides reasoning observability into implicit reasoning processes.
arXiv Detail & Related papers (2024-11-05T19:13:22Z) - Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs)
We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data.
We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models [38.79074982172423]
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
We propose modeling factual queries as constraint satisfaction problems.
We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations.
arXiv Detail & Related papers (2023-09-26T17:48:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.