Related papers: ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

URL: http://arxiv.org/abs/2310.10505v4
Date: Thu, 16 May 2024 02:22:23 GMT
Title: ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Authors: Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo,
Abstract summary: Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs) We present ReMax, which leverages 3 properties of RLHF: fast simulation, deterministic transitions, and trajectory-level rewards. It is simpler to implement, eliminates more than 4 hyper- parameters in PPO, reduces GPU memory usage, and shortens training time. Applying ReMax to a Mistral-7B model resulted in a 94.78% win rate on the AlpacaEval leaderboard and a 7.739 score on MT-bench, setting a new SOTA for open-source 7B models
Score: 30.276168676690045
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general reinforcement learning tasks, it is overly sophisticated for LLMs, leading to laborious hyper-parameter tuning and significant computation burdens. To make RLHF efficient, we present ReMax, which leverages 3 properties of RLHF: fast simulation, deterministic transitions, and trajectory-level rewards. These properties are not exploited in PPO, making it less suitable for RLHF. Building on the renowned REINFORCE algorithm, ReMax does not require training an additional value model as in PPO and is further enhanced with a new variance reduction technique. ReMax offers several benefits over PPO: it is simpler to implement, eliminates more than 4 hyper-parameters in PPO, reduces GPU memory usage, and shortens training time. ReMax can save about 46% GPU memory than PPO when training a 7B model and enables training on A800-80GB GPUs without the memory-saving offloading technique needed by PPO. Applying ReMax to a Mistral-7B model resulted in a 94.78% win rate on the AlpacaEval leaderboard and a 7.739 score on MT-bench, setting a new SOTA for open-source 7B models. These results show the effectiveness of ReMax while addressing the limitations of PPO in LLMs.

Related papers

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model [56.92219181993453]
We propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix) to enable on-policyRFT methods like PPO and GRPO to leverage off-policy data.<n>ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady improvement.
arXiv Detail & Related papers (2025-07-09T14:29:45Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates. We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function [14.7365465149829]
We propose textbfUNified textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO. With this novel mapping between a reward model and an optimal policy, UNA can 1. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process.
arXiv Detail & Related papers (2024-08-27T18:04:07Z)
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [12.321332446941378]
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. We introduce ReaL, a pioneering system for efficient RLHF training. We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs.
arXiv Detail & Related papers (2024-06-20T08:04:07Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences. To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced. We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z)
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs [29.505270680223003]
AI alignment in the shape of Reinforcement Learning from Human Feedback is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. We show that many components of PPO are unnecessary in an RLHF context and that simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT.
arXiv Detail & Related papers (2024-02-22T17:52:34Z)
Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF [47.960563851948514]
We investigate an efficient implementation of RLHF using low-rank adaptation (LoRA) Our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
arXiv Detail & Related papers (2023-09-16T17:31:36Z)
Efficient RLHF: Reducing the Memory Usage of PPO [61.45357428856269]
We present a comprehensive analysis of the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.
arXiv Detail & Related papers (2023-09-01T22:57:20Z)
Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.