Related papers: Improving RL Exploration for LLM Reasoning through Retrospective Replay

Improving RL Exploration for LLM Reasoning through Retrospective Replay

URL: http://arxiv.org/abs/2504.14363v1
Date: Sat, 19 Apr 2025 17:40:04 GMT
Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay
Authors: Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract summary: We propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process.<n>RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration.
Score: 45.00643118030677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model's ability to revisit and re-explore these ideas later. Consequently, although the LLM's capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.

Related papers

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities [62.05713042908654]
This paper provides a review of advances in Large Language Models (LLMs) alignment through the lens of inverse reinforcement learning (IRL)<n>We highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift.
arXiv Detail & Related papers (2025-07-17T14:22:24Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs [12.087316618902433]
Reasoning large language models (LLMs) excel in complex tasks.<n>Existing approaches allocate an equal number of rollouts to all questions during reinforcement learning (RL)<n>We propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems.
arXiv Detail & Related papers (2025-05-24T07:28:29Z)
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents [34.25887147052966]
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving.<n>More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use.<n>Key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation.
arXiv Detail & Related papers (2025-05-21T05:09:43Z)
Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models [53.4530106173067]
Large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks. RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge.
arXiv Detail & Related papers (2025-04-03T04:46:17Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs [58.18140409409302]
Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL)<n>Applying RL in broader domains like chatbots and content generation presents unique challenges.<n>We show a case study of reproducing existing reward model ensemble research using embedding-based reward models.
arXiv Detail & Related papers (2025-02-04T19:37:35Z)
Reinforcement Learning Enhanced LLMs: A Survey [45.57586245741664]
We will make a systematic review of the most up-to-date state of knowledge on RL-enhanced large language models (LLMs)<n>Specifically, we detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF)
arXiv Detail & Related papers (2024-12-05T16:10:42Z)
Towards Sample-Efficiency and Generalization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review [50.67937325077047]
This paper is devoted to a comprehensive review of realizing the sample efficiency and generalization of RL algorithms through transfer and inverse reinforcement learning (T-IRL) Our findings denote that a majority of recent research works have dealt with the aforementioned challenges by utilizing human-in-the-loop and sim-to-real strategies. Under the IRL structure, training schemes that require a low number of experience transitions and extension of such frameworks to multi-agent and multi-intention problems have been the priority of researchers in recent years.
arXiv Detail & Related papers (2024-11-15T15:18:57Z)
Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks. PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals. We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z)
Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Planning Case Study [1.3597551064547502]
We employ a teacher-student learning framework to tackle problems of Large Language Models (LLMs) and reinforcement learning (RL) models.<n>Within this framework, the LLM acts as a teacher, while the RL model acts as a student.<n>We propose a practical algorithm to address the problem and conduct empirical experiments to evaluate the effectiveness of our method.
arXiv Detail & Related papers (2024-01-12T14:35:57Z)
Evolutionary Reinforcement Learning: A Survey [31.112066295496003]
Reinforcement learning (RL) is a machine learning approach that trains agents to maximize cumulative rewards through interactions with environments. This article presents a comprehensive survey of state-of-the-art methods for integrating EC into RL, referred to as evolutionary reinforcement learning (EvoRL)
arXiv Detail & Related papers (2023-03-07T01:38:42Z)
Ensemble Reinforcement Learning: A Survey [43.17635633600716]
Reinforcement Learning (RL) has emerged as a highly effective technique for addressing various scientific and applied problems. In response, ensemble reinforcement learning (ERL), a promising approach that combines the benefits of both RL and ensemble learning (EL), has gained widespread popularity. ERL leverages multiple models or training algorithms to comprehensively explore the problem space and possesses strong generalization capabilities.
arXiv Detail & Related papers (2023-03-05T09:26:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.