Nebula-I: A General Framework for Collaboratively Training Deep Learning
Models on Low-Bandwidth Cloud Clusters
- URL: http://arxiv.org/abs/2205.09470v1
- Date: Thu, 19 May 2022 11:10:14 GMT
- Title: Nebula-I: A General Framework for Collaboratively Training Deep Learning
Models on Low-Bandwidth Cloud Clusters
- Authors: Yang Xiang, Zhihua Wu, Weibao Gong, Siyu Ding, Xianjie Mo, Yuang Liu,
Shuohuan Wang, Peng Liu, Yongshuai Hou, Long Li, Bin Wang, Shaohuai Shi,
Yaqian Han, Yue Yu, Ge Li, Yu Sun, Yanjun Ma, Dianhai Yu
- Abstract summary: We introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters.
Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware.
Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance.
- Score: 39.85470606966918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ever-growing model size and scale of compute have attracted increasing
interests in training deep learning models over multiple nodes. However, when
it comes to training on cloud clusters, especially across remote clusters, huge
challenges are faced. In this work, we introduce a general framework, Nebula-I,
for collaboratively training deep learning models over remote heterogeneous
clusters, the connections between which are low-bandwidth wide area networks
(WANs). We took natural language processing (NLP) as an example to show how
Nebula-I works in different training phases that include: a) pre-training a
multilingual language model using two remote clusters; and b) fine-tuning a
machine translation model using knowledge distilled from pre-trained models,
which run through the most popular paradigm of recent deep learning. To balance
the accuracy and communication efficiency, in Nebula-I, parameter-efficient
training strategies, hybrid parallel computing methods and adaptive
communication acceleration techniques are jointly applied. Meanwhile, security
strategies are employed to guarantee the safety, reliability and privacy in
intra-cluster computation and inter-cluster communication. Nebula-I is
implemented with the PaddlePaddle deep learning framework, which can support
collaborative training over heterogeneous hardware, e.g. GPU and NPU.
Experiments demonstrate that the proposed framework could substantially
maximize the training efficiency while preserving satisfactory NLP performance.
By using Nebula-I, users can run large-scale training tasks over cloud clusters
with minimum developments, and the utility of existed large pre-trained models
could be further promoted. We also introduced new state-of-the-art results on
cross-lingual natural language inference tasks, which are generated based upon
a novel learning framework and Nebula-I.
Related papers
- ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - FedBone: Towards Large-Scale Federated Multi-Task Learning [13.835972363413884]
In real-world applications, visual and natural language tasks typically require large-scale models to extract high-level abstract features.
Existing HFML methods disregard the impact of gradient conflicts on multi-task optimization.
We propose an innovative framework called FedBone, which enables the construction of large-scale models with better generalization.
arXiv Detail & Related papers (2023-06-30T08:19:38Z) - Personalizing Federated Learning with Over-the-Air Computations [84.8089761800994]
Federated edge learning is a promising technology to deploy intelligence at the edge of wireless networks in a privacy-preserving manner.
Under such a setting, multiple clients collaboratively train a global generic model under the coordination of an edge server.
This paper presents a distributed training paradigm that employs analog over-the-air computation to address the communication bottleneck.
arXiv Detail & Related papers (2023-02-24T08:41:19Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Local Learning with Neuron Groups [15.578925277062657]
Local learning is an approach to model-parallelism that removes the standard end-to-end learning setup.
We study how local learning can be applied at the level of splitting layers or modules into sub-components.
arXiv Detail & Related papers (2023-01-18T16:25:10Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Supernet Training for Federated Image Classification under System
Heterogeneity [15.2292571922932]
In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup)
It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training.
Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead.
arXiv Detail & Related papers (2022-06-03T02:21:01Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - Distributed Training of Deep Learning Models: A Taxonomic Perspective [11.924058430461216]
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster.
We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
arXiv Detail & Related papers (2020-07-08T08:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.