VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
- URL: http://arxiv.org/abs/2412.02172v1
- Date: Tue, 03 Dec 2024 05:04:49 GMT
- Title: VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
- Authors: Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, Nanyun Peng,
- Abstract summary: We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs.
VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought.
LookBack significantly improves critique and correction performance by up to 13.5%.
- Score: 112.35483894933904
- License:
- Abstract: The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. LookBack significantly improves critique and correction performance by up to 13.5%.
Related papers
- RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)
Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities.
It self-improves by training on synthetic data, generated by a contrastive-based self-critic.
It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning [46.411313304605564]
Critic-V is a framework inspired by the Actor-Critic paradigm to boost the reasoning capability of vision-language models (VLMs)
The Reasoner generates reasoning paths based on visual and textual inputs, and the Critic provides constructive critique to refine these paths.
evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks.
arXiv Detail & Related papers (2024-11-27T10:28:57Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies.
A key property of the proposed Detect, Critique, Refine ("DCR") method is that the step 2 critique model can give fine-grained feedback about errors.
We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models.
To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios.
To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.