Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
- URL: http://arxiv.org/abs/2509.22613v1
- Date: Fri, 26 Sep 2025 17:39:48 GMT
- Title: Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
- Authors: Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen,
- Abstract summary: reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs)<n>In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction.<n>Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration.
- Score: 52.38531288378491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
Related papers
- Diversity or Precision? A Deep Dive into Next Token Prediction [19.30494719444709]
We study how the pre-trained token-output distribution shapes the exploration potential for subsequent reinforcement learning.<n>We find that imposing a precision-oriented gradient prior yields a superior exploration space for RL.
arXiv Detail & Related papers (2025-12-28T14:53:24Z) - Scaling Reinforcement Learning for Content Moderation with Large Language Models [16.516137166093696]
We present a comprehensive empirical investigation of scaling reinforcement learning for content classification.<n>We show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning.
arXiv Detail & Related papers (2025-12-23T05:27:16Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Outcome-based Exploration for LLM Reasoning [18.33816564983908]
Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models.<n>We show that RL can reduce effective diversity even on the training set relative to the base model.<n>We propose outcome-based exploration, which assigns exploration bonuses according to final outcomes.
arXiv Detail & Related papers (2025-09-08T17:52:56Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Advances in Preference-based Reinforcement Learning: A Review [1.474723404975345]
Preference-based reinforcement learning (PbRL) addresses that by utilizing human preferences as feedback from the experts instead of numeric rewards.
We present a unified PbRL framework to include the newly emerging approaches that improve the scalability and efficiency of PbRL.
arXiv Detail & Related papers (2024-08-21T18:57:12Z) - Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks [5.968716050740402]
This paper focuses on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks.
We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent.
Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity.
arXiv Detail & Related papers (2024-02-14T10:44:03Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.