Related papers: Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width

URL: http://arxiv.org/abs/2002.04010v2
Date: Mon, 24 Feb 2020 21:12:54 GMT
Title: Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width
Authors: Yu Bai, Ben Krause, Huan Wang, Caiming Xiong, Richard Socher
Abstract summary: Taylorized training involves training the $k$-th order Taylor expansion of the neural network. We show that Taylorized training agrees with full neural network training increasingly better as we increase $k$. We complement our experiments with theoretical results showing that the approximation error of $k$-th order Taylorized models decay exponentially over $k$ in wide neural networks.
Score: 116.69845849754186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose \emph{Taylorized training} as an initiative towards better understanding neural network training at finite width. Taylorized training involves training the $k$-th order Taylor expansion of the neural network at initialization, and is a principled extension of linearized training---a recently proposed theory for understanding the success of deep learning. We experiment with Taylorized training on modern neural network architectures, and show that Taylorized training (1) agrees with full neural network training increasingly better as we increase $k$, and (2) can significantly close the performance gap between linearized and full training. Compared with linearized training, higher-order training works in more realistic settings such as standard parameterization and large (initial) learning rate. We complement our experiments with theoretical results showing that the approximation error of $k$-th order Taylorized models decay exponentially over $k$ in wide neural networks.

Related papers

Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network. We provide analytical expressions for these speed limits for linear and linearizable neural networks. Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z)
One Forward is Enough for Neural Network Training via Likelihood Ratio Method [47.013384887197454]
Backpropagation (BP) is the mainstream approach for gradient computation in neural network training. We develop a unified likelihood ratio (ULR) method for estimation with just one forward propagation.
arXiv Detail & Related papers (2023-05-15T19:02:46Z)
Maximal Initial Learning Rates in Deep ReLU Networks [32.157430904535126]
We introduce the maximal initial learning rate $etaast$. We observe that in constant-width fully-connected ReLU networks, $etaast$ behaves differently from the maximum learning rate later in training.
arXiv Detail & Related papers (2022-12-14T15:58:37Z)
Neural Capacitance: A New Perspective of Neural Network Selection via Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction. We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training. Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z)
Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies. We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z)
Edge of chaos as a guiding principle for modern neural network training [19.419382003562976]
We study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram. In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset.
arXiv Detail & Related papers (2021-07-20T12:17:55Z)
Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z)
Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination [11.173092834726528]
It is desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks.
arXiv Detail & Related papers (2020-10-17T01:23:26Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.