Crossover-SGD: A gossip-based communication in distributed deep learning
for alleviating large mini-batch problem and enhancing scalability
- URL: http://arxiv.org/abs/2012.15198v1
- Date: Wed, 30 Dec 2020 15:39:13 GMT
- Title: Crossover-SGD: A gossip-based communication in distributed deep learning
for alleviating large mini-batch problem and enhancing scalability
- Authors: Sangho Yeo, Minho Bae, Minjoong Jeong, Oh-kyoung Kwon, Sangyoon Oh
- Abstract summary: We study the characteristics of gossip methods in a large mini-batch problem.
We propose Crossover-SGD that alleviates the delay propagation of weight parameters via segment-wise communication.
We also adapt hierarchical communication to limit the number of workers in gossip-based communication methods.
- Score: 0.5249805590164902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed deep learning is an effective way to reduce the training time of
deep learning for large datasets as well as complex models. However, the
limited scalability caused by network overheads makes it difficult to
synchronize the parameters of all workers. To resolve this problem,
gossip-based methods that demonstrates stable scalability regardless of the
number of workers have been proposed. However, to use gossip-based methods in
general cases, the validation accuracy for a large mini-batch needs to be
verified. To verify this, we first empirically study the characteristics of
gossip methods in a large mini-batch problem and observe that the gossip
methods preserve higher validation accuracy than AllReduce-SGD(Stochastic
Gradient Descent) when the number of batch sizes is increased and the number of
workers is fixed. However, the delayed parameter propagation of the
gossip-based models decreases validation accuracy in large node scales. To cope
with this problem, we propose Crossover-SGD that alleviates the delay
propagation of weight parameters via segment-wise communication and load
balancing random network topology. We also adapt hierarchical communication to
limit the number of workers in gossip-based communication methods. To validate
the effectiveness of our proposed method, we conduct empirical experiments and
observe that our Crossover-SGD shows higher node scalability than
SGP(Stochastic Gradient Push).
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Asynchronous Federated Stochastic Optimization for Heterogeneous Objectives Under Arbitrary Delays [0.0]
Federated learning (FL) was recently proposed to securely train models with data held over multiple locations ("clients")
Two major challenges hindering the performance of FL algorithms are long training times caused by straggling clients, and a decline in model accuracy under non-iid local data distributions ("client drift")
We propose and analyze Asynchronous Exact Averaging (AREA), a new (sub)gradient algorithm that utilizes communication to speed up convergence and enhance scalability, and employs client memory to correct the client drift caused by variations in client update frequencies.
arXiv Detail & Related papers (2024-05-16T14:22:49Z) - Few-Shot Class Incremental Learning via Robust Transformer Approach [16.590193619691416]
Few-Shot Class-Incremental Learning presents an extension of the Class Incremental Learning problem where a model is faced with the problem of data scarcity.
This problem remains an open problem because all recent works are built upon the convolutional neural networks performing sub-optimally.
Our paper presents Robust Transformer Approach built upon the Compact Convolution Transformer.
arXiv Detail & Related papers (2024-05-08T03:35:52Z) - Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification
in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks.
We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD.
The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z) - Quantized Distributed Training of Large Models with Convergence
Guarantees [34.054462975511996]
We present QSDP, a variant of FSDP which supports both quant and weight gradientization with theoretical guarantees.
We show that QSDP preserves accuracy, while completely removing the communication of FSDP, providing the speed-to-endups of up to 2.2x.
arXiv Detail & Related papers (2023-02-05T14:20:55Z) - Communication-Compressed Adaptive Gradient Method for Distributed
Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers.
Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.