Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
- URL: http://arxiv.org/abs/2507.00748v1
- Date: Tue, 01 Jul 2025 13:48:57 GMT
- Title: Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
- Authors: Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yanbin Hao,
- Abstract summary: Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks.
- Score: 28.95877614294155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.
Related papers
- WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z) - High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning [43.8114307203968]
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images.<n>In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO)<n>MGPO enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images.
arXiv Detail & Related papers (2025-07-08T12:05:05Z) - PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z) - Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z) - Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying [7.9925771591348065]
Core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs.<n>In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples.<n>We propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings.
arXiv Detail & Related papers (2025-05-28T11:18:19Z) - DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning [33.574626079343936]
We introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs.<n>In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights.<n>In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset.
arXiv Detail & Related papers (2025-05-26T17:20:17Z) - UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning [30.073631823776825]
We propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding.<n>We first construct a high-quality Chain-of-Thought grounding dataset, annotated with detailed reasoning chains.<n>We then perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities.
arXiv Detail & Related papers (2025-05-20T11:40:43Z) - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)<n>We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.<n>We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z) - Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [54.22256089592864]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute.<n>Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.