Related papers: Large Batch Training Does Not Need Warmup

Large Batch Training Does Not Need Warmup

URL: http://arxiv.org/abs/2002.01576v1
Date: Tue, 4 Feb 2020 23:03:12 GMT
Title: Large Batch Training Does Not Need Warmup
Authors: Zhouyuan Huo, Bin Gu, Heng Huang
Abstract summary: Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
Score: 111.07680619360528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. However, the optimizer converges slowly at early epochs and there is a gap between large-batch deep learning optimization heuristics and theoretical underpinnings. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. We also analyze the convergence rate of the proposed method by introducing a new fine-grained analysis of gradient-based methods. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques, including linear learning rate scaling, gradual warmup, and layer-wise adaptive rate scaling. Extensive experiments demonstrate that the proposed algorithm outperforms gradual warmup technique by a large margin and defeats the convergence of the state-of-the-art large-batch optimizer in training advanced deep neural networks (ResNet, DenseNet, MobileNet) on ImageNet dataset.

Related papers

Simmering: Sufficient is better than optimal for training neural networks [0.0]
We introduce simmering, a physics-based method that trains neural networks to generate weights and biases that are merely good enough'' We show that simmering corrects neural networks that are overfit by Adam, and show that simmering avoids overfitting if deployed from the outset. Our results question optimization as a paradigm for neural network training, and leverage information-geometric arguments to point to the existence of classes of sufficient training algorithms.
arXiv Detail & Related papers (2024-10-25T18:02:08Z)
Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits [9.361762652324968]
A properly tuned learning rate leads to faster training and higher test accuracy. We propose a Lipschitz bandit-driven approach for tuning the learning rate of neural networks.
arXiv Detail & Related papers (2024-09-15T16:21:55Z)
Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training [30.452060061499523]
We introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. Experiments demonstrate the effectiveness of the approximation technique in neural network training.
arXiv Detail & Related papers (2024-03-18T23:23:50Z)
The Cascaded Forward Algorithm for Neural Network Training [61.06444586991505]
We propose a new learning framework for neural networks, namely Cascaded Forward (CaFo) algorithm, which does not rely on BP optimization as that in FF. Unlike FF, our framework directly outputs label distributions at each cascaded block, which does not require generation of additional negative samples. In our framework each block can be trained independently, so it can be easily deployed into parallel acceleration systems.
arXiv Detail & Related papers (2023-03-17T02:01:11Z)
Hyper-Learning for Gradient-Based Batch Size Adaptation [2.944323057176686]
Scheduling the batch size to increase is an effective strategy to control noise when training deep neural networks. We introduce Arbiter as a new hyper-optimization algorithm to perform batch size adaptations for learnable schedulings. We demonstrate Arbiter's effectiveness in several illustrative experiments.
arXiv Detail & Related papers (2022-05-17T11:01:14Z)
Analytically Tractable Inference in Deep Neural Networks [0.0]
Tractable Approximate Inference (TAGI) algorithm was shown to be a viable and scalable alternative to backpropagation for shallow fully-connected neural networks. We are demonstrating how TAGI matches or exceeds the performance of backpropagation, for training classic deep neural network architectures.
arXiv Detail & Related papers (2021-03-09T14:51:34Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity. Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.