RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
- URL: http://arxiv.org/abs/2409.13221v2
- Date: Wed, 25 Sep 2024 22:28:06 GMT
- Title: RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
- Authors: Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, Xin Jin,
- Abstract summary: Existing RLHF systems suffer from low GPU utilization in production deployments.
RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks.
RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.
- Score: 10.165579735221092
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.
Related papers
- MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences.
We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process.
We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z) - HybridFlow: A Flexible and Efficient RLHF Framework [13.80577212781375]
Reinforcement Learning from Human Feedback is widely used in Large Language Model (LLM) alignment.
Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN)
We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow.
arXiv Detail & Related papers (2024-09-28T06:20:03Z) - ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation [12.321332446941378]
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications.
We propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster.
We introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training.
arXiv Detail & Related papers (2024-06-20T08:04:07Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model
with Proxy [47.327200425168314]
Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure Large Language Models (LLMs) align with human values.
We introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs.
Our method achieves a comparable level of alignment with only 1% of the training parameters of other methods.
arXiv Detail & Related papers (2024-03-07T07:31:00Z) - ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z) - Hypernetworks in Meta-Reinforcement Learning [47.25270748922176]
Multi-task reinforcement learning (RL) and meta-RL aim to improve sample efficiency by generalizing over a distribution of related tasks.
State of the art methods often fail to outperform a degenerate solution that simply learns each task separately.
Hypernetworks are a promising path forward since they replicate the separate policies of the degenerate solution and are applicable to meta-RL.
arXiv Detail & Related papers (2022-10-20T15:34:52Z) - DL-DRL: A double-level deep reinforcement learning approach for
large-scale task scheduling of multi-UAV [65.07776277630228]
We propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF)
Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs.
We also exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks.
arXiv Detail & Related papers (2022-08-04T04:35:53Z) - Decoupling Representation Learning from Reinforcement Learning [89.82834016009461]
We introduce an unsupervised learning task called Augmented Temporal Contrast (ATC)
ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference.
In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL.
arXiv Detail & Related papers (2020-09-14T19:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.