Related papers: Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

URL: http://arxiv.org/abs/2603.02556v1
Date: Tue, 03 Mar 2026 03:18:31 GMT
Title: Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
Authors: Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye,
Abstract summary: We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
Score: 60.93949629734977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

Related papers

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy [75.66913260900726]
Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
arXiv Detail & Related papers (2026-01-11T08:25:34Z)
Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z)
Self-Rewarding Vision-Language Model via Reasoning Decomposition [49.784411666601905]
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts.<n>We introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions.<n>Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts.
arXiv Detail & Related papers (2025-08-27T08:01:03Z)
VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification [22.871255950998016]
We introduce a novel framework for inference-time visual tokens scaling that enables MLLMs to perform verifier-guided reasoning over visual content.<n>Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks.<n>These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
arXiv Detail & Related papers (2025-06-08T17:38:49Z)
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.<n>We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z)
VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [34.95077625513563]
We introduce textbfVLM2-Bench, a benchmark designed to assess whether vision-language models can Visually Link Matching cues.<n> Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings.<n>We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap.
arXiv Detail & Related papers (2025-02-17T17:57:50Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z)
Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning. We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering. We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.