Related papers: Rethinking Memory and Communication Cost for Efficient Large Language Model Training

Rethinking Memory and Communication Cost for Efficient Large Language Model Training

URL: http://arxiv.org/abs/2310.06003v2
Date: Mon, 30 Oct 2023 08:07:50 GMT
Title: Rethinking Memory and Communication Cost for Efficient Large Language Model Training
Authors: Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang and Jun Zhou
Abstract summary: We rethink the impact of memory consumption and communication costs on the training speed of large language models. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
Score: 25.640899145028296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.

Related papers

DEPT: Decoupled Embeddings for Pre-training Language Models [16.84502158672086]
We propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models.
arXiv Detail & Related papers (2024-10-07T13:24:24Z)
FedsLLM: Federated Split Learning for Large Language Models over Communication Networks [30.47242577997792]
This paper combines low-rank adaptation technology (LoRA) with the splitfed learning framework to propose the federated split learning for large language models (FedsLLM) framework. The proposed algorithm reduces delays by an average of 47.63% compared to unoptimized scenarios.
arXiv Detail & Related papers (2024-07-12T13:23:54Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Personalizing Federated Learning with Over-the-Air Computations [84.8089761800994]
Federated edge learning is a promising technology to deploy intelligence at the edge of wireless networks in a privacy-preserving manner. Under such a setting, multiple clients collaboratively train a global generic model under the coordination of an edge server. This paper presents a distributed training paradigm that employs analog over-the-air computation to address the communication bottleneck.
arXiv Detail & Related papers (2023-02-24T08:41:19Z)
Federated Reinforcement Learning at the Edge [1.4271989597349055]
Modern cyber-physical architectures use data collected from systems at different physical locations to learn appropriate behaviors and adapt to uncertain environments. This paper considers a setup where multiple agents need to communicate efficiently in order to jointly solve a reinforcement learning problem over time-series data collected in a distributed manner. An algorithm for achieving communication efficiency is proposed, supported with theoretical guarantees, practical implementations, and numerical evaluations.
arXiv Detail & Related papers (2021-12-11T03:28:59Z)
Federated Learning over Wireless IoT Networks with Optimized Communication and Resources [98.18365881575805]
Federated learning (FL) as a paradigm of collaborative learning techniques has obtained increasing research attention. It is of interest to investigate fast responding and accurate FL schemes over wireless systems. We show that the proposed communication-efficient federated learning framework converges at a strong linear rate.
arXiv Detail & Related papers (2021-10-22T13:25:57Z)
Toward Communication Efficient Adaptive Gradient Method [29.02154169980269]
In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. In the hope of training machine learning models on mobile devices, a new distributed training paradigm called federated learning'' has become popular. We propose an adaptive gradient method that can guarantee both the convergence and the communication efficiency for federated learning.
arXiv Detail & Related papers (2021-09-10T21:14:36Z)
Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update. We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z)
CosSGD: Nonlinear Quantization for Communication-efficient Federated Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server. We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning. Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z)
Distributed Sparse SGD with Majority Voting [5.32836690371986]
We introduce a majority voting based sparse communication strategy for distributed learning. We show that it is possible to achieve up to x4000 compression without any loss in the test accuracy.
arXiv Detail & Related papers (2020-11-12T17:06:36Z)
Communication-Efficient and Distributed Learning Over Wireless Networks: Principles and Applications [55.65768284748698]
Machine learning (ML) is a promising enabler for the fifth generation (5G) communication systems and beyond. This article aims to provide a holistic overview of relevant communication and ML principles, and thereby present communication-efficient and distributed learning frameworks with selected use cases.
arXiv Detail & Related papers (2020-08-06T12:37:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.