An Adaptive Placement and Parallelism Framework for Accelerating RLHF
Training
- URL: http://arxiv.org/abs/2312.11819v2
- Date: Thu, 25 Jan 2024 02:46:06 GMT
- Title: An Adaptive Placement and Parallelism Framework for Accelerating RLHF
Training
- Authors: Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao,
Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
- Abstract summary: We propose an adaptive model placement framework that offers two flexible model placement strategies.
Interleaving and Separation strategies can achieve notable improvements up to 11X, compared to the current SOTA approaches.
- Score: 12.191192247301853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, ChatGPT or InstructGPT like large language models (LLM) has made a
significant impact in the AI world. Many works have attempted to reproduce the
complex InstructGPT's training pipeline, namely Reinforcement Learning with
Human Feedback (RLHF). However, the mainstream distributed RLHF training
methods typically adopt a fixed model placement strategy, referred to as the
Flattening strategy. This strategy treats all four interdependent models
involved in RLHF as a single entity, distributing them across all devices and
applying parallelism techniques designed for a single model, regardless of the
different workloads inherent to each model. As a result, this strategy
exacerbates the generation bottlenecks in the RLHF training and degrades the
overall training efficiency. To address these issues, we propose an adaptive
model placement framework that offers two flexible model placement strategies.
The Interleaving strategy helps reduce memory redundancy and communication
costs of RLHF training by placing models without dependencies on exclusive
devices with careful orchestration. On the other hand, the Separation strategy
improves the throughput of model training by separating the training and
inference runtime of the RLHF pipeline with additional shadow models.
Furthermore, our framework provides a simple user interface and allows for the
agile allocation of models across devices in a fine-grained manner for various
training scenarios, involving models of varying sizes and devices of different
scales. Extensive experiments have demonstrated that our Interleaving and
Separation strategies can achieve notable improvements up to 11X, compared to
the current SOTA approaches. The results highlight the effectiveness and
adaptability of our approaches in accelerating the training of distributed
RLHF.
Related papers
- A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power.
We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation.
We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks.
Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance.
In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Vertical Federated Learning over Cloud-RAN: Convergence Analysis and
System Optimization [82.12796238714589]
We propose a novel cloud radio access network (Cloud-RAN) based vertical FL system to enable fast and accurate model aggregation.
We characterize the convergence behavior of the vertical FL algorithm considering both uplink and downlink transmissions.
We establish a system optimization framework by joint transceiver and fronthaul quantization design, for which successive convex approximation and alternate convex search based system optimization algorithms are developed.
arXiv Detail & Related papers (2023-05-04T09:26:03Z) - Memory-adaptive Depth-wise Heterogenous Federated Learning [24.13198329419849]
We introduce a memory-adaptive depth-wise learning solution in FL called FeDepth, which adaptively decomposes the full model into blocks according to the memory budgets of each client.
Our method outperforms state-of-the-art approaches, achieving 5% and more than 10% improvements in top-1 accuracy on CIFAR-10 and CIFAR-100, respectively.
arXiv Detail & Related papers (2023-03-08T20:52:57Z) - Tensor Decomposition based Personalized Federated Learning [12.420951968273574]
Federated learning (FL) is a new distributed machine learning framework that can achieve reliably collaborative training without collecting users' private data.
Due to FL's frequent communication and average aggregation strategy, they experience challenges scaling to statistical diversity data and large-scale models.
We propose a personalized FL framework, named Decomposition based Personalized learning (TDPFed), in which we design a novel tensorized local model with tensorized linear layers and convolutional layers to reduce the communication cost.
arXiv Detail & Related papers (2022-08-27T08:09:14Z) - Efficient Split-Mix Federated Learning for On-Demand and In-Situ
Customization [107.72786199113183]
Federated learning (FL) provides a distributed learning framework for multiple participants to collaborate learning without sharing raw data.
In this paper, we propose a novel Split-Mix FL strategy for heterogeneous participants that, once training is done, provides in-situ customization of model sizes and robustness.
arXiv Detail & Related papers (2022-03-18T04:58:34Z) - FedHe: Heterogeneous Models and Communication-Efficient Federated
Learning [0.0]
Federated learning (FL) is able to manage edge devices to cooperatively train a model while maintaining the training data local and private.
We propose a novel FL method, called FedHe, inspired by knowledge distillation, which can train heterogeneous models and support asynchronous training processes.
arXiv Detail & Related papers (2021-10-19T12:18:37Z) - Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples.
We propose a new framework called SPROUT, self-progressing robust training.
Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.