Distributed Training of Deep Learning Models: A Taxonomic Perspective
- URL: http://arxiv.org/abs/2007.03970v1
- Date: Wed, 8 Jul 2020 08:56:58 GMT
- Title: Distributed Training of Deep Learning Models: A Taxonomic Perspective
- Authors: Matthias Langer, Zhen He, Wenny Rahayu, and Yanbo Xue
- Abstract summary: Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster.
We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
- Score: 11.924058430461216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed deep learning systems (DDLS) train deep neural network models by
utilizing the distributed resources of a cluster. Developers of DDLS are
required to make many decisions to process their particular workloads in their
chosen environment efficiently. The advent of GPU-based deep learning, the
ever-increasing size of datasets and deep neural network models, in combination
with the bandwidth constraints that exist in cluster environments require
developers of DDLS to be innovative in order to train high quality models
quickly. Comparing DDLS side-by-side is difficult due to their extensive
feature lists and architectural deviations. We aim to shine some light on the
fundamental principles that are at work when training deep neural networks in a
cluster of independent machines by analyzing the general properties associated
with training deep learning models and how such workloads can be distributed in
a cluster to achieve collaborative model training. Thereby we provide an
overview of the different techniques that are used by contemporary DDLS and
discuss their influence and implications on the training process. To
conceptualize and compare DDLS, we group different techniques into categories,
thus establishing a taxonomy of distributed deep learning systems.
Related papers
- Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings [1.0589208420411014]
This survey explores the landscape of distributed learning, encompassing cloud and edge settings.
We delve into the core concepts of data and model parallelism, examining how models are partitioned across different dimensions and layers to optimize resource utilization and performance.
We analyze various partitioning schemes for different layer types, including fully connected, convolutional, and recurrent layers, highlighting the trade-offs between computational efficiency, communication overhead, and memory constraints.
arXiv Detail & Related papers (2024-05-23T22:00:38Z) - BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion [56.9358325168226]
We propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND)
Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model.
Our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model.
arXiv Detail & Related papers (2024-03-23T08:40:38Z) - Diffusion-based Neural Network Weights Generation [85.6725307453325]
We propose an efficient and adaptive transfer learning scheme through dataset-conditioned pretrained weights sampling.
Specifically, we use a latent diffusion model with a variational autoencoder that can reconstruct the neural network weights.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Canoe : A System for Collaborative Learning for Neural Nets [4.547883122787855]
Canoe is a framework that facilitates knowledge transfer for neural networks.
Canoe provides new system support for dynamically extracting significant parameters from a helper node's neural network.
The evaluation of Canoe with different PyTorch and neural network models demonstrates that the knowledge transfer mechanism improves the model's adaptiveness to 3.5X compared to learning in isolation.
arXiv Detail & Related papers (2021-08-27T05:30:15Z) - Model-Based Deep Learning [155.063817656602]
Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques.
Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance.
We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches.
arXiv Detail & Related papers (2020-12-15T16:29:49Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.