History-Guided Iterative Visual Reasoning with Self-Correction
- URL: http://arxiv.org/abs/2602.04413v1
- Date: Wed, 04 Feb 2026 10:42:06 GMT
- Title: History-Guided Iterative Visual Reasoning with Self-Correction
- Authors: Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang,
- Abstract summary: Self-consistency methods are the core technique for improving the reasoning reliability of large language models (MLLMs)<n>We propose the H-GIVR framework, inspired by the human reasoning behavior of repeated verification and dynamic error correction.<n>The framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost.
- Score: 29.952971071458574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.
Related papers
- MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z) - ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models [23.37951284612929]
We construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains.<n>In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding.<n>In addition, we propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction.
arXiv Detail & Related papers (2025-12-01T09:05:02Z) - Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z) - Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - Representation Consistency for Accurate and Coherent LLM Answer Aggregation [31.694036998078264]
representation consistency (RC) is a test-time scaling method for aggregating answers drawn from multiple candidate responses of an large language model.<n>RC enhances answer aggregation by considering the number of occurrences of each answer in the candidate response set.<n>Our method only uses cached activations and lightweight similarity computations and requires no additional model queries.
arXiv Detail & Related papers (2025-06-18T05:07:47Z) - The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z) - LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling [39.61818305829112]
We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step.<n>PIR identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components.<n>Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets.
arXiv Detail & Related papers (2025-05-25T15:17:57Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Vision-Language Models Can Self-Improve Reasoning via Reflection [20.196406628954303]
Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs)
We propose a self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales.
Our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.
arXiv Detail & Related papers (2024-10-30T14:45:00Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.