Diversity-Aware Policy Optimization for Large Language Model Reasoning
- URL: http://arxiv.org/abs/2505.23433v1
- Date: Thu, 29 May 2025 13:27:44 GMT
- Title: Diversity-Aware Policy Optimization for Large Language Model Reasoning
- Authors: Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, Kay Chen Tan,
- Abstract summary: We investigate the impact of diversity in RL-based training for large language models.<n>We propose a novel diversity-aware policy optimization method.<n>Our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks.
- Score: 30.460540027658173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
Related papers
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Multimodal Mathematical Reasoning with Diverse Solving Perspective [65.07953438724105]
We introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair.<n>We propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization.<n>Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions.
arXiv Detail & Related papers (2025-07-03T17:07:20Z) - WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning [17.459985667824807]
Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise.<n>In this paper, we show how to achieve the general-purpose visual-language reasoning through reinforcement learning.
arXiv Detail & Related papers (2025-06-09T16:20:54Z) - R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO [91.25793883692036]
We aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL)<n>We propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space.<n>In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants.
arXiv Detail & Related papers (2025-05-22T13:39:32Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models [22.796496516709514]
This paper provides a systematic review of recent advances in reinforcement learning (RL)-based reasoning for Multimodal Large Language Models (MLLMs)<n>We highlight two main RL paradigms, value-model-free and value-model-based methods, and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information.<n>We provide an extensive overview of benchmark datasets, evaluation protocols, and current limitations, and propose future research directions to address challenges such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints.
arXiv Detail & Related papers (2025-04-30T03:14:28Z) - Pareto Set Learning for Multi-Objective Reinforcement Learning [19.720934024901542]
We propose a decomposition-based framework for Multi-Objective RL (MORL)<n>PSL-MORL harnesses the generation capability of hypernetwork to produce the parameters of the policy network for each decomposition weight.<n>We show that PSL-MORL significantly outperforms state-of-the-art MORL methods in the hypervolume and sparsity indicators.
arXiv Detail & Related papers (2025-01-12T10:43:05Z) - MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z) - Guiding Reinforcement Learning Using Uncertainty-Aware Large Language Models [1.2233495442213964]
Large Language Models (LLMs) offer a promising alternative to mitigate RL sample inefficiency and potentially replace human trainers.<n>We address this limitation by a calibrated guidance system that uses Monte Carlo Dropout to enhance LLM advice reliability.<n>We also develop a novel RL policy shaping method based on dynamic model average entropy to adjust the LLM's influence on RL policies according to guidance uncertainty.
arXiv Detail & Related papers (2024-11-15T22:00:29Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.<n>Existing methods work by emulating the preferences at the single decision (turn) level.<n>We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Variational Empowerment as Representation Learning for Goal-Based
Reinforcement Learning [114.07623388322048]
We discuss how the standard goal-conditioned RL (GCRL) is encapsulated by the objective variational empowerment.
Our work lays a novel foundation from which to evaluate, analyze, and develop representation learning techniques in goal-based RL.
arXiv Detail & Related papers (2021-06-02T18:12:26Z) - Provable Multi-Objective Reinforcement Learning with Generative Models [98.19879408649848]
We study the problem of single policy MORL, which learns an optimal policy given the preference of objectives.
Existing methods require strong assumptions such as exact knowledge of the multi-objective decision process.
We propose a new algorithm called model-based envelop value (EVI) which generalizes the enveloped multi-objective $Q$-learning algorithm.
arXiv Detail & Related papers (2020-11-19T22:35:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.