GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
- URL: http://arxiv.org/abs/2507.10628v2
- Date: Wed, 16 Jul 2025 15:30:11 GMT
- Title: GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
- Authors: Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
- Score: 15.43938821214447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model's reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.
Related papers
- Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z) - HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness [49.72591739116668]
Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs)<n>However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training.<n>We propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework.
arXiv Detail & Related papers (2025-10-10T13:42:03Z) - CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning [10.544900224321415]
We propose CLPO, a novel algorithm that creates a dynamic pedagogical feedback loop within the policy optimization process.<n>Curriculum-guided Learning for Policy Optimization transforms the static training procedure into a dynamic process that co-evolves with the model's capabilities.<n>Experiments show that CLPO achieves state-of-the-art performance across eight challenging mathematical and general reasoning benchmarks.
arXiv Detail & Related papers (2025-09-29T16:29:04Z) - Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning [37.20632079882874]
We introduce Difficulty Aware Certainty guided Exploration (DACE)<n>It balances the exploration exploitation trade-off based on the policys success rate.<n>Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-08-29T08:57:54Z) - VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning [69.44871115752055]
We propose an advanced multimodal reasoning model trained via a novel Progressive Curriculum Reinforcement Learning (PCuRL) framework.<n>PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts.<n>The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity.
arXiv Detail & Related papers (2025-07-30T12:23:21Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models [95.6332110724999]
Motivation-enhanced Reinforcement Finetuning (MeRF) is an intuitive yet effective method enhancing reinforcement learning of Large Language Models (LLMs)<n>MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective.<n> Empirical evaluations on the Knights and Knaves(K&K) logic puzzle reasoning benchmark demonstrate that textttMeRF achieves substantial performance gains over baselines.
arXiv Detail & Related papers (2025-06-23T10:37:57Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation [5.793561443238794]
We propose two novel strategies to enhance the capability of large language models to solve complex problems.<n>First, Adaptive Difficulty Curriculum Learning (ADCL) is a novel curriculum learning strategy that tackles the Difficulty Shift phenomenon.<n>Second, Expert-Guided Self-Reformulation (EGSR) is a novel reinforcement learning strategy that bridges the gap between imitation learning and pure exploration.
arXiv Detail & Related papers (2025-05-13T09:10:48Z) - DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning [27.336254612018404]
Reinforcement learning from expert demonstrations has long remained a challenging research problem.<n>Existing state-of-the-art methods using behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability.<n>We propose a novel strategy-based reinforcement learning framework integrated with large language models (LLMs) to overcome these limitations.
arXiv Detail & Related papers (2025-05-06T05:53:09Z) - How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study [16.441081996257576]
This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning strategies can substantially improve reasoning performance.<n>We show that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization.<n>We will open-source our datasets on GitHub and Hugging Face.
arXiv Detail & Related papers (2025-04-01T14:18:38Z) - LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs [85.688901949146]
We investigate the question: does the benefit of complex robust training methods diminish as language models become more powerful?<n>Our analysis reveals a consistent trend: emphthe marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.<n>Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes.
arXiv Detail & Related papers (2025-02-17T03:34:31Z) - Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks.
PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals.
We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z) - Enhancing Q-Learning with Large Language Model Heuristics [0.0]
Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations.
We propose textbfLLM-guided Q-learning, a framework that leverages LLMs as hallucinations to aid in learning the Q-function for reinforcement learning.
arXiv Detail & Related papers (2024-05-06T10:42:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.