Related papers: PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

URL: http://arxiv.org/abs/2506.14907v1
Date: Tue, 17 Jun 2025 18:25:56 GMT
Title: PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Authors: Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li, Zhong-zhi Li, Peijie Wang, Jie Wu, Lei Ji, Yelong Shen, Yujiu Yang, Yeyun Gong,
Abstract summary: We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
Score: 50.21619363035618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.

Related papers

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning [28.111812077758845]
Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks.
arXiv Detail & Related papers (2025-07-01T13:48:57Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning [30.073631823776825]
We propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding.<n>We first construct a high-quality Chain-of-Thought grounding dataset, annotated with detailed reasoning chains.<n>We then perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities.
arXiv Detail & Related papers (2025-05-20T11:40:43Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval [13.59418209417664]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query without training samples.<n>We propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning.
arXiv Detail & Related papers (2025-02-28T08:12:23Z)
Deep Multimodal Collaborative Learning for Polyp Re-Identification [4.4028428688691905]
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras. Traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset produce unsatisfactory retrieval performance. We propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification.
arXiv Detail & Related papers (2024-08-12T04:05:19Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
Mask-based Latent Reconstruction for Reinforcement Learning [58.43247393611453]
Mask-based Latent Reconstruction (MLR) is proposed to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels. Extensive experiments show that our MLR significantly improves the sample efficiency in deep reinforcement learning.
arXiv Detail & Related papers (2022-01-28T13:07:11Z)
StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation. We learn a latent embedding, jointly with the generator, that models the variability of the output domain. Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.