Teaching Large Language Models to Reason with Reinforcement Learning
- URL: http://arxiv.org/abs/2403.04642v1
- Date: Thu, 7 Mar 2024 16:36:29 GMT
- Title: Teaching Large Language Models to Reason with Reinforcement Learning
- Authors: Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos
Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar
Sukhbaatar, Roberta Raileanu
- Abstract summary: Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
- Score: 38.17625148525193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a
dominant approach for aligning LLM outputs with human preferences. Inspired by
the success of RLHF, we study the performance of multiple algorithms that learn
from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}),
Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate
both sparse and dense rewards provided to the LLM both heuristically and via a
learned reward model. We additionally start from multiple model sizes and
initializations both with and without supervised fine-tuning (\textbf{SFT})
data. Overall, we find all algorithms perform comparably, with Expert Iteration
performing best in most cases. Surprisingly, we find the sample complexity of
Expert Iteration is similar to that of PPO, requiring at most on the order of
$10^6$ samples to converge from a pretrained checkpoint. We investigate why
this is the case, concluding that during RL training models fail to explore
significantly beyond solutions already produced by SFT models. Additionally, we
discuss a trade off between maj@1 and pass@96 metric performance during SFT
training and how conversely RL training improves both simultaneously. We then
conclude by discussing the implications of our findings for RLHF and the future
role of RL in LLM fine-tuning.
Related papers
- LMGT: Optimizing Exploration-Exploitation Balance in Reinforcement Learning through Language Model Guided Trade-offs [27.014415210732103]
We introduce textbfLanguage textbfModel textbfGuided textbfTrade-offs (i.e., textbfLMGT), a novel, sample-efficient framework for Reinforcement Learning.
arXiv Detail & Related papers (2024-09-07T07:40:43Z) - ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback [13.154512864498912]
We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT)
First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT)
Second, we ask the Teacher to correct the wrong reasoning after the RL stage. With the correction feedback, we stabilize the RL fine-tuned model through SFT.
arXiv Detail & Related papers (2024-06-25T07:20:11Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.
As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.
This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)
By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient
Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text.
This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion.
We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF)
We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.