Related papers: Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

URL: http://arxiv.org/abs/2103.02351v1
Date: Wed, 3 Mar 2021 12:08:23 GMT
Title: Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates
Authors: Sebastian U. Stich, Amirkeivan Mohtashami, Martin Jaggi
Abstract summary: It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
Score: 67.19481956584465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.

Related papers

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n. This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z)
Asymmetric Momentum: A Rethinking of Gradient Descent [4.1001738811512345]
We propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM) By averaging the loss, we divide training process into different loss phases and using different momentum. We experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset.
arXiv Detail & Related papers (2023-09-05T11:16:47Z)
Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications. Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure. We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z)
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow. We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z)
Escaping Saddle Points Faster with Stochastic Momentum [9.485782209646445]
In deep networks, momentum appears to significantly improve convergence time. We show that momentum improves deep training because it modifies SGD to escape points faster. We also show how to choose the ideal momentum parameter.
arXiv Detail & Related papers (2021-06-05T23:34:02Z)
On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks. We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights. We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods. We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.