Study on the Large Batch Size Training of Neural Networks Based on the
Second Order Gradient
- URL: http://arxiv.org/abs/2012.08795v1
- Date: Wed, 16 Dec 2020 08:43:15 GMT
- Title: Study on the Large Batch Size Training of Neural Networks Based on the
Second Order Gradient
- Authors: Fengli Gao and Huicai Zhong
- Abstract summary: Large batch size training in deep neural networks (DNNs) possesses a well-known 'generalization gap' that remarkably induces generalization performance degradation.
Here, we combine theory with experiments to explore the evolution of the basic structural properties, including gradient, parameter update step length, and loss update step length of NNs under varying batch sizes.
- Score: 1.3794617022004712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large batch size training in deep neural networks (DNNs) possesses a
well-known 'generalization gap' that remarkably induces generalization
performance degradation. However, it remains unclear how varying batch size
affects the structure of a NN. Here, we combine theory with experiments to
explore the evolution of the basic structural properties, including gradient,
parameter update step length, and loss update step length of NNs under varying
batch sizes. We provide new guidance to improve generalization, which is
further verified by two designed methods involving discarding small-loss
samples and scheduling batch size. A curvature-based learning rate (CBLR)
algorithm is proposed to better fit the curvature variation, a sensitive factor
affecting large batch size training, across layers in a NN. As an approximation
of CBLR, the median-curvature LR (MCLR) algorithm is found to gain comparable
performance to Layer-wise Adaptive Rate Scaling (LARS) algorithm. Our
theoretical results and algorithm offer geometry-based explanations to the
existing studies. Furthermore, we demonstrate that the layer wise LR
algorithms, for example LARS, can be regarded as special instances of CBLR.
Finally, we deduce a theoretical geometric picture of large batch size
training, and show that all the network parameters tend to center on their
related minima.
Related papers
- Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - The Cascaded Forward Algorithm for Neural Network Training [61.06444586991505]
We propose a new learning framework for neural networks, namely Cascaded Forward (CaFo) algorithm, which does not rely on BP optimization as that in FF.
Unlike FF, our framework directly outputs label distributions at each cascaded block, which does not require generation of additional negative samples.
In our framework each block can be trained independently, so it can be easily deployed into parallel acceleration systems.
arXiv Detail & Related papers (2023-03-17T02:01:11Z) - Reparameterization through Spatial Gradient Scaling [69.27487006953852]
Reparameterization aims to improve the generalization of deep neural networks by transforming convolutional layers into equivalent multi-branched structures during training.
We present a novel spatial gradient scaling method to redistribute learning focus among weights in convolutional networks.
arXiv Detail & Related papers (2023-03-05T17:57:33Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Wide Bayesian neural networks have a simple weight posterior: theory and
accelerated sampling [48.94555574632823]
Repriorisation transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.
We develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN.
We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks.
arXiv Detail & Related papers (2022-06-15T17:11:08Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Semi-Implicit Back Propagation [1.5533842336139065]
We propose a semi-implicit back propagation method for neural network training.
The difference on the neurons are propagated in a backward fashion and the parameters are updated with proximal mapping.
Experiments on both MNIST and CIFAR-10 demonstrate that the proposed algorithm leads to better performance in terms of both loss decreasing and training/validation accuracy.
arXiv Detail & Related papers (2020-02-10T03:26:09Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.