Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle
- URL: http://arxiv.org/abs/2509.16679v1
- Date: Sat, 20 Sep 2025 13:11:28 GMT
- Title: Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle
- Authors: Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, Lihua Zhang,
- Abstract summary: Reinforcement Learning (RL) has markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs)<n>This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs.
- Score: 66.80133103857703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
Related papers
- Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - A Survey of Reinforcement Learning for Large Reasoning Models [98.58081012669369]
Review of recent advances in Reinforcement Learning for reasoning with Large Language Models.<n>Further scaling of RL for LRMs now faces challenges not only in computational resources but also in algorithm design, training data, and infrastructure.
arXiv Detail & Related papers (2025-09-10T17:59:43Z) - From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR [92.51110344832178]
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects.
arXiv Detail & Related papers (2025-08-11T01:26:16Z) - Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities [62.05713042908654]
This paper provides a review of advances in Large Language Models (LLMs) alignment through the lens of inverse reinforcement learning (IRL)<n>We highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift.
arXiv Detail & Related papers (2025-07-17T14:22:24Z) - A Technical Survey of Reinforcement Learning Techniques for Large Language Models [33.38582292895673]
Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs)<n>RLHF remains dominant for alignment, and outcome-based RL such as RLVR significantly improves stepwise reasoning.<n> persistent challenges such as reward hacking, computational costs, and scalable feedback collection underscore the need for continued innovation.
arXiv Detail & Related papers (2025-07-05T19:13:00Z) - Reinforcement Learning Enhanced LLMs: A Survey [45.57586245741664]
We will make a systematic review of the most up-to-date state of knowledge on RL-enhanced large language models (LLMs)<n>Specifically, we detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF)
arXiv Detail & Related papers (2024-12-05T16:10:42Z) - Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods [18.771658054884693]
Large language models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in aspects such as multi-task learning, sample efficiency, and high-level task planning.
We propose a structured taxonomy to systematically categorize LLMs' functionalities in RL, including four roles: information processor, reward designer, decision-maker, and generator.
arXiv Detail & Related papers (2024-03-30T08:28:08Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Reinforcement Learning from LLM Feedback to Counteract Goal
Misgeneralization [0.0]
We introduce a method to address goal misgeneralization in reinforcement learning (RL)
Goal misgeneralization occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one.
This study demonstrates how the Large Language Model can efficiently supervise RL agents.
arXiv Detail & Related papers (2024-01-14T01:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.