Rethinking Memory and Communication Cost for Efficient Large Language
Model Training
- URL: http://arxiv.org/abs/2310.06003v2
- Date: Mon, 30 Oct 2023 08:07:50 GMT
- Title: Rethinking Memory and Communication Cost for Efficient Large Language
Model Training
- Authors: Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin
Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang and Jun Zhou
- Abstract summary: We rethink the impact of memory consumption and communication costs on the training speed of large language models.
Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method.
The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
- Score: 25.640899145028296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, various distributed strategies for large language model training
have been proposed. However, these methods provided limited solutions for the
trade-off between memory consumption and communication cost. In this paper, we
rethink the impact of memory consumption and communication costs on the
training speed of large language models, and propose a memory-communication
balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides
comprehensive options which reduces the amount and frequency of inter-group
communication with minor memory redundancy by fine-grained sharding strategy,
thereby improving the training efficiency in various training scenarios.
Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring)
communication topology to enhance communication efficiency between nodes or
across switches in large language model training. Our experiments demonstrate
that PaRO significantly improves training throughput by 1.19x-2.50x compared to
the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm
improves communication efficiency by 36.5% compared to the traditional Ring
algorithm.
Related papers
- FedsLLM: Federated Split Learning for Large Language Models over Communication Networks [30.47242577997792]
This paper combines low-rank adaptation technology (LoRA) with the splitfed learning framework to propose the federated split learning for large language models (FedsLLM) framework.
The proposed algorithm reduces delays by an average of 47.63% compared to unoptimized scenarios.
arXiv Detail & Related papers (2024-07-12T13:23:54Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Personalizing Federated Learning with Over-the-Air Computations [84.8089761800994]
Federated edge learning is a promising technology to deploy intelligence at the edge of wireless networks in a privacy-preserving manner.
Under such a setting, multiple clients collaboratively train a global generic model under the coordination of an edge server.
This paper presents a distributed training paradigm that employs analog over-the-air computation to address the communication bottleneck.
arXiv Detail & Related papers (2023-02-24T08:41:19Z) - Federated Reinforcement Learning at the Edge [1.4271989597349055]
Modern cyber-physical architectures use data collected from systems at different physical locations to learn appropriate behaviors and adapt to uncertain environments.
This paper considers a setup where multiple agents need to communicate efficiently in order to jointly solve a reinforcement learning problem over time-series data collected in a distributed manner.
An algorithm for achieving communication efficiency is proposed, supported with theoretical guarantees, practical implementations, and numerical evaluations.
arXiv Detail & Related papers (2021-12-11T03:28:59Z) - Federated Learning over Wireless IoT Networks with Optimized
Communication and Resources [98.18365881575805]
Federated learning (FL) as a paradigm of collaborative learning techniques has obtained increasing research attention.
It is of interest to investigate fast responding and accurate FL schemes over wireless systems.
We show that the proposed communication-efficient federated learning framework converges at a strong linear rate.
arXiv Detail & Related papers (2021-10-22T13:25:57Z) - Toward Communication Efficient Adaptive Gradient Method [29.02154169980269]
In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks.
In the hope of training machine learning models on mobile devices, a new distributed training paradigm called federated learning'' has become popular.
We propose an adaptive gradient method that can guarantee both the convergence and the communication efficiency for federated learning.
arXiv Detail & Related papers (2021-09-10T21:14:36Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - CosSGD: Nonlinear Quantization for Communication-efficient Federated
Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server.
We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning.
Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z) - Distributed Sparse SGD with Majority Voting [5.32836690371986]
We introduce a majority voting based sparse communication strategy for distributed learning.
We show that it is possible to achieve up to x4000 compression without any loss in the test accuracy.
arXiv Detail & Related papers (2020-11-12T17:06:36Z) - Communication-Efficient and Distributed Learning Over Wireless Networks:
Principles and Applications [55.65768284748698]
Machine learning (ML) is a promising enabler for the fifth generation (5G) communication systems and beyond.
This article aims to provide a holistic overview of relevant communication and ML principles, and thereby present communication-efficient and distributed learning frameworks with selected use cases.
arXiv Detail & Related papers (2020-08-06T12:37:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.