Related papers: Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

URL: http://arxiv.org/abs/2507.00748v1
Date: Tue, 01 Jul 2025 13:48:57 GMT
Title: Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
Authors: Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yanbin Hao,
Abstract summary: Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks.
Score: 28.95877614294155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.

Related papers

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z)
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning [43.8114307203968]
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images.<n>In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO)<n>MGPO enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images.
arXiv Detail & Related papers (2025-07-08T12:05:05Z)
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying [7.9925771591348065]
Core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs.<n>In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples.<n>We propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings.
arXiv Detail & Related papers (2025-05-28T11:18:19Z)
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning [33.574626079343936]
We introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs.<n>In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights.<n>In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset.
arXiv Detail & Related papers (2025-05-26T17:20:17Z)
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning [30.073631823776825]
We propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding.<n>We first construct a high-quality Chain-of-Thought grounding dataset, annotated with detailed reasoning chains.<n>We then perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities.
arXiv Detail & Related papers (2025-05-20T11:40:43Z)
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)<n>We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.<n>We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z)
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [54.22256089592864]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute.<n>Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.