Related papers: MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

URL: http://arxiv.org/abs/2506.22434v1
Date: Fri, 27 Jun 2025 17:59:27 GMT
Title: MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Authors: Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao,
Abstract summary: Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
Score: 72.81576836419373
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning [41.59815187158526]
We train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs.<n>Our model, named Visionary-R1, outperforms strong multimodal models on multiple visual reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T17:58:35Z)
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights [26.85150689408895]
We show that existing multimodal mathematical models minimally leverage visual information.<n>We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers.<n>In testing leading models, their failure to detect subtle visual differences suggests limitations in current visual perception capabilities.
arXiv Detail & Related papers (2025-03-06T07:29:33Z)
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.<n>We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.<n>Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
Cross-Modal Contrastive Learning for Robust Reasoning in VQA [76.1596796687494]
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. Most reasoning models heavily rely on shortcuts learned from training data. We propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning.
arXiv Detail & Related papers (2022-11-21T05:32:24Z)
Visualizing and Understanding Contrastive Learning [22.553990823550784]
We design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We also adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations.
arXiv Detail & Related papers (2022-06-20T13:01:46Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness. We develop a diagnostic model, namely Graph Reasoning Machine. Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z)
Unsupervised Landmark Learning from Unpaired Data [117.81440795184587]
Recent attempts for unsupervised landmark learning leverage synthesized image pairs that are similar in appearance but different in poses. We propose a cross-image cycle consistency framework which applies the swapping-reconstruction strategy twice to obtain the final supervision. Our proposed framework is shown to outperform strong baselines by a large margin.
arXiv Detail & Related papers (2020-06-29T13:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.