Related papers: Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

URL: http://arxiv.org/abs/2012.03747v1
Date: Thu, 3 Dec 2020 11:52:55 GMT
Title: Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization
Authors: Huiping Zhuang, Zhiping Lin, Kar-Ann Toh
Abstract summary: We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
Score: 16.02377434191239
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decoupled learning is a branch of model parallelism which parallelizes the training of a network by splitting it depth-wise into multiple modules. Techniques from decoupled learning usually lead to stale gradient effect because of their asynchronous implementation, thereby causing performance degradation. In this paper, we propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We give both theoretical and empirical evidences regarding how the gradient staleness can be reduced. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. Empirical validation is provided by training deep convolutional neural networks to perform classification tasks on CIFAR-10 and ImageNet datasets. The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Classifier-guided Gradient Modulation for Enhanced Multimodal Learning [50.7008456698935]
Gradient-Guided Modulation (CGGM) is a novel method to balance multimodal learning with gradients. We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS. CGGM outperforms all the baselines and other state-of-the-art methods consistently.
arXiv Detail & Related papers (2024-11-03T02:38:43Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Delving into Effective Gradient Matching for Dataset Condensation [13.75957901381024]
gradient matching method directly targets the training dynamics by matching the gradient when training on the original and synthetic datasets. We propose to match the multi-level gradients to involve both intra-class and inter-class gradient information. An overfitting-aware adaptive learning step strategy is also proposed to trim unnecessary optimization steps for algorithmic efficiency improvement.
arXiv Detail & Related papers (2022-07-30T21:31:10Z)
Adaptive Learning Rate and Momentum for Training Deep Neural Networks [0.0]
We develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. Experiments in image classification datasets show that our method yields faster convergence than other local solvers.
arXiv Detail & Related papers (2021-06-22T05:06:56Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear Classification [3.158346511479111]
We propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions. We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum.
arXiv Detail & Related papers (2020-11-23T07:50:09Z)
Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD. Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Embedding Propagation: Smoother Manifold for Few-Shot Classification [131.81692677836202]
We propose to use embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing in few-shot classification. We empirically show that embedding propagation yields a smoother embedding manifold. We show that embedding propagation consistently improves the accuracy of the models in multiple semi-supervised learning scenarios by up to 16% points.
arXiv Detail & Related papers (2020-03-09T13:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.