On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
- URL: http://arxiv.org/abs/2502.06773v1
- Date: Mon, 10 Feb 2025 18:52:04 GMT
- Title: On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
- Authors: Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin Peng, Beibin Li, Janardhan Kulkarni, Huseyin A. Inan,
- Abstract summary: We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP)
RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking.
Empirical studies in the math domain show that RLSP improves reasoning.
- Score: 34.32871896067864
- License:
- Abstract: Recent AI advancements, such as OpenAI's new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality outputs. We aim to uncover the algorithmic framework for training LRMs. Methods like self-consistency, PRM, and AlphaZero suggest reasoning as guided search. We ask: what is the simplest, most scalable way to enable search in LLMs? We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, (2) using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and (3) RL training with an outcome verifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness signals during PPO training, carefully balancing them to improve performance and efficiency. Empirical studies in the math domain show that RLSP improves reasoning. On the Llama-3.1-8B-Instruct model, RLSP can boost performance by 23% in MATH-500 test set; On AIME 2024 math problems, Qwen2.5-32B-Instruct improved by 10% due to RLSP. However, a more important finding of this work is that the models trained using RLSP, even with the simplest exploration reward that encourages the model to take more intermediate steps, showed several emergent behaviors such as backtracking, exploration of ideas, and verification. These findings demonstrate that RLSP framework might be enough to enable emergence of complex reasoning abilities in LLMs when scaled. Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs inspired by a remarkable result that says CoT provably increases computational power of LLMs, which grows as the number of steps in CoT \cite{li2024chain,merrill2023expresssive}.
Related papers
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.
Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.
We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.
We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Reinforcement Learning Enhanced LLMs: A Survey [45.57586245741664]
We make a systematic review of the most up-to-date state of knowledge on RL-enhanced language models.
We detail the basics of RL; introduce popular RL-enhanced LLMs; review researches on two widely-used reward model-based RL techniques.
We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements.
arXiv Detail & Related papers (2024-12-05T16:10:42Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement
Learning and Large Language Models [2.5721733711031978]
We review research studies that combine Reinforcement Learning (RL) and Large Language Models (LLMs)
We propose a novel taxonomy of three main classes based on the way that the two model types interact with each other.
arXiv Detail & Related papers (2024-02-02T20:01:15Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.