Stacking as Accelerated Gradient Descent
- URL: http://arxiv.org/abs/2403.04978v1
- Date: Fri, 8 Mar 2024 01:23:25 GMT
- Title: Stacking as Accelerated Gradient Descent
- Authors: Naman Agarwal and Pranjal Awasthi and Satyen Kale and Eric Zhao
- Abstract summary: Stacking is a technique for training deep residual networks by progressively increasing the number of layers.
We propose a theoretical explanation for the efficacy of stacking.
We prove that for certain deep linear residual networks, stacking does provide accelerated training.
- Score: 44.17524017365296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stacking, a heuristic technique for training deep residual networks by
progressively increasing the number of layers and initializing new layers by
copying parameters from older layers, has proven quite successful in improving
the efficiency of training deep neural networks. In this paper, we propose a
theoretical explanation for the efficacy of stacking: viz., stacking implements
a form of Nesterov's accelerated gradient descent. The theory also covers
simpler models such as the additive ensembles constructed in boosting methods,
and provides an explanation for a similar widely-used practical heuristic for
initializing the new classifier in each round of boosting. We also prove that
for certain deep linear residual networks, stacking does provide accelerated
training, via a new potential function analysis of the Nesterov's accelerated
gradient method which allows errors in updates. We conduct proof-of-concept
experiments to validate our theory as well.
Related papers
- Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks [15.691263438655842]
Spiking Neural Network (SNN) is a biologically inspired neural network infrastructure that has recently garnered significant attention.
Training an SNN directly poses a challenge due to the undefined gradient of the firing spike process.
We propose a shortcut back-propagation method in our paper, which advocates for transmitting the gradient directly from the loss to the shallow layers.
arXiv Detail & Related papers (2024-01-09T10:54:41Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Proxy Convexity: A Unified Framework for the Analysis of Neural Networks
Trained by Gradient Descent [95.94432031144716]
We propose a unified non- optimization framework for the analysis of a learning network.
We show that existing guarantees can be trained unified through gradient descent.
arXiv Detail & Related papers (2021-06-25T17:45:00Z) - Backward Gradient Normalization in Deep Neural Networks [68.8204255655161]
We introduce a new technique for gradient normalization during neural network training.
The gradients are rescaled during the backward pass using normalization layers introduced at certain points within the network architecture.
Results on tests with very deep neural networks show that the new technique can do an effective control of the gradient norm.
arXiv Detail & Related papers (2021-06-17T13:24:43Z) - Initialization and Regularization of Factorized Neural Layers [23.875225732697142]
We show how to initialize and regularize Factorized layers in deep nets.
We show how these schemes lead to improved performance on both translation and unsupervised pre-training.
arXiv Detail & Related papers (2021-05-03T17:28:07Z) - Deep Networks from the Principle of Rate Reduction [32.87280757001462]
This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification.
We show that the basic iterative ascent gradient scheme for optimizing the rate reduction of learned features naturally leads to a multi-layer deep network, one iteration per layer.
All components of this "white box" network have precise optimization, statistical, and geometric interpretation.
arXiv Detail & Related papers (2020-10-27T06:01:43Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.