Related papers: An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

URL: http://arxiv.org/abs/2312.11819v2
Date: Thu, 25 Jan 2024 02:46:06 GMT
Title: An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training
Authors: Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
Abstract summary: We propose an adaptive model placement framework that offers two flexible model placement strategies. Interleaving and Separation strategies can achieve notable improvements up to 11X, compared to the current SOTA approaches.
Score: 12.191192247301853
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Flattening strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the different workloads inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose an adaptive model placement framework that offers two flexible model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Separation strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and allows for the agile allocation of models across devices in a fine-grained manner for various training scenarios, involving models of varying sizes and devices of different scales. Extensive experiments have demonstrated that our Interleaving and Separation strategies can achieve notable improvements up to 11X, compared to the current SOTA approaches. The results highlight the effectiveness and adaptability of our approaches in accelerating the training of distributed RLHF.

Related papers

A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power. We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation. We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z)
Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks. Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance. In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Vertical Federated Learning over Cloud-RAN: Convergence Analysis and System Optimization [82.12796238714589]
We propose a novel cloud radio access network (Cloud-RAN) based vertical FL system to enable fast and accurate model aggregation. We characterize the convergence behavior of the vertical FL algorithm considering both uplink and downlink transmissions. We establish a system optimization framework by joint transceiver and fronthaul quantization design, for which successive convex approximation and alternate convex search based system optimization algorithms are developed.
arXiv Detail & Related papers (2023-05-04T09:26:03Z)
Memory-adaptive Depth-wise Heterogenous Federated Learning [24.13198329419849]
We introduce a memory-adaptive depth-wise learning solution in FL called FeDepth, which adaptively decomposes the full model into blocks according to the memory budgets of each client. Our method outperforms state-of-the-art approaches, achieving 5% and more than 10% improvements in top-1 accuracy on CIFAR-10 and CIFAR-100, respectively.
arXiv Detail & Related papers (2023-03-08T20:52:57Z)
Tensor Decomposition based Personalized Federated Learning [12.420951968273574]
Federated learning (FL) is a new distributed machine learning framework that can achieve reliably collaborative training without collecting users' private data. Due to FL's frequent communication and average aggregation strategy, they experience challenges scaling to statistical diversity data and large-scale models. We propose a personalized FL framework, named Decomposition based Personalized learning (TDPFed), in which we design a novel tensorized local model with tensorized linear layers and convolutional layers to reduce the communication cost.
arXiv Detail & Related papers (2022-08-27T08:09:14Z)
Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization [107.72786199113183]
Federated learning (FL) provides a distributed learning framework for multiple participants to collaborate learning without sharing raw data. In this paper, we propose a novel Split-Mix FL strategy for heterogeneous participants that, once training is done, provides in-situ customization of model sizes and robustness.
arXiv Detail & Related papers (2022-03-18T04:58:34Z)
FedHe: Heterogeneous Models and Communication-Efficient Federated Learning [0.0]
Federated learning (FL) is able to manage edge devices to cooperatively train a model while maintaining the training data local and private. We propose a novel FL method, called FedHe, inspired by knowledge distillation, which can train heterogeneous models and support asynchronous training processes.
arXiv Detail & Related papers (2021-10-19T12:18:37Z)
Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples. We propose a new framework called SPROUT, self-progressing robust training. Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.