Nebula-I: A General Framework for Collaboratively Training Deep Learning
Models on Low-Bandwidth Cloud Clusters
- URL: http://arxiv.org/abs/2205.09470v1
- Date: Thu, 19 May 2022 11:10:14 GMT
- Title: Nebula-I: A General Framework for Collaboratively Training Deep Learning
Models on Low-Bandwidth Cloud Clusters
- Authors: Yang Xiang, Zhihua Wu, Weibao Gong, Siyu Ding, Xianjie Mo, Yuang Liu,
Shuohuan Wang, Peng Liu, Yongshuai Hou, Long Li, Bin Wang, Shaohuai Shi,
Yaqian Han, Yue Yu, Ge Li, Yu Sun, Yanjun Ma, Dianhai Yu
- Abstract summary: We introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters.
Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware.
Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance.
- Score: 39.85470606966918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ever-growing model size and scale of compute have attracted increasing
interests in training deep learning models over multiple nodes. However, when
it comes to training on cloud clusters, especially across remote clusters, huge
challenges are faced. In this work, we introduce a general framework, Nebula-I,
for collaboratively training deep learning models over remote heterogeneous
clusters, the connections between which are low-bandwidth wide area networks
(WANs). We took natural language processing (NLP) as an example to show how
Nebula-I works in different training phases that include: a) pre-training a
multilingual language model using two remote clusters; and b) fine-tuning a
machine translation model using knowledge distilled from pre-trained models,
which run through the most popular paradigm of recent deep learning. To balance
the accuracy and communication efficiency, in Nebula-I, parameter-efficient
training strategies, hybrid parallel computing methods and adaptive
communication acceleration techniques are jointly applied. Meanwhile, security
strategies are employed to guarantee the safety, reliability and privacy in
intra-cluster computation and inter-cluster communication. Nebula-I is
implemented with the PaddlePaddle deep learning framework, which can support
collaborative training over heterogeneous hardware, e.g. GPU and NPU.
Experiments demonstrate that the proposed framework could substantially
maximize the training efficiency while preserving satisfactory NLP performance.
By using Nebula-I, users can run large-scale training tasks over cloud clusters
with minimum developments, and the utility of existed large pre-trained models
could be further promoted. We also introduced new state-of-the-art results on
cross-lingual natural language inference tasks, which are generated based upon
a novel learning framework and Nebula-I.
Related papers
- Research on Key Technologies for Cross-Cloud Federated Training of Large Language Models [7.762524368844918]
Cross-cloud federated training offers a new approach to addressing the resource bottlenecks of a single cloud platform.
This study analyzes the key technologies of cross-cloud federated training, including data partitioning and distribution, communication optimization, model aggregation algorithms, and the compatibility of heterogeneous cloud platforms.
arXiv Detail & Related papers (2024-10-24T19:57:17Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - FedBone: Towards Large-Scale Federated Multi-Task Learning [13.835972363413884]
In real-world applications, visual and natural language tasks typically require large-scale models to extract high-level abstract features.
Existing HFML methods disregard the impact of gradient conflicts on multi-task optimization.
We propose an innovative framework called FedBone, which enables the construction of large-scale models with better generalization.
arXiv Detail & Related papers (2023-06-30T08:19:38Z) - Personalizing Federated Learning with Over-the-Air Computations [84.8089761800994]
Federated edge learning is a promising technology to deploy intelligence at the edge of wireless networks in a privacy-preserving manner.
Under such a setting, multiple clients collaboratively train a global generic model under the coordination of an edge server.
This paper presents a distributed training paradigm that employs analog over-the-air computation to address the communication bottleneck.
arXiv Detail & Related papers (2023-02-24T08:41:19Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Local Learning with Neuron Groups [15.578925277062657]
Local learning is an approach to model-parallelism that removes the standard end-to-end learning setup.
We study how local learning can be applied at the level of splitting layers or modules into sub-components.
arXiv Detail & Related papers (2023-01-18T16:25:10Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Distributed Training of Deep Learning Models: A Taxonomic Perspective [11.924058430461216]
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster.
We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
arXiv Detail & Related papers (2020-07-08T08:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.