R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
- URL: http://arxiv.org/abs/2505.16673v1
- Date: Thu, 22 May 2025 13:39:32 GMT
- Title: R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
- Authors: Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang,
- Abstract summary: We aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL)<n>We propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space.<n>In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants.
- Score: 91.25793883692036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.
Related papers
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Multimodal Mathematical Reasoning with Diverse Solving Perspective [65.07953438724105]
We introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair.<n>We propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization.<n>Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions.
arXiv Detail & Related papers (2025-07-03T17:07:20Z) - Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs [112.40801692473723]
Balancing exploration and exploitation is a central goal in reinforcement learning (RL)<n>We introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term.<n>Our method achieves significant gains on the Pass@K metric, even when evaluated with extremely large K values.
arXiv Detail & Related papers (2025-06-17T17:54:03Z) - Diversity-Aware Policy Optimization for Large Language Model Reasoning [30.460540027658173]
We investigate the impact of diversity in RL-based training for large language models.<n>We propose a novel diversity-aware policy optimization method.<n>Our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-05-29T13:27:44Z) - Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning [60.84901522792042]
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs)<n>We propose R1, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state.<n>R1- can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
arXiv Detail & Related papers (2025-05-28T08:17:57Z) - LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization [30.95342819013663]
Large language models (LLMs) have demonstrated impressive capabilities in reasoning.<n>Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches.<n>We propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG.
arXiv Detail & Related papers (2025-05-23T04:04:05Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models [22.796496516709514]
This paper provides a systematic review of recent advances in reinforcement learning (RL)-based reasoning for Multimodal Large Language Models (MLLMs)<n>We highlight two main RL paradigms, value-model-free and value-model-based methods, and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information.<n>We provide an extensive overview of benchmark datasets, evaluation protocols, and current limitations, and propose future research directions to address challenges such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints.
arXiv Detail & Related papers (2025-04-30T03:14:28Z) - SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM [18.275547804539016]
Two-Staged history-Resampling Policy optimization surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks.<n>We introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples.
arXiv Detail & Related papers (2025-04-19T13:06:03Z) - OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning [29.053899071144976]
We propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks.<n>Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy.<n> GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation.
arXiv Detail & Related papers (2025-03-20T12:22:18Z) - Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG [79.61269381878547]
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs)<n>This paper shifts away from prior dedicated approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs.<n>We propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context.
arXiv Detail & Related papers (2025-03-03T06:40:21Z) - Reinforcement Replaces Supervision: Query focused Summarization using
Deep Reinforcement Learning [43.123290672073814]
We deal with systems that generate summaries from document(s) based on a query.
Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, we use an RL-based approach for this task.
We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity.
arXiv Detail & Related papers (2023-11-29T10:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.