Training Multi-Layer Over-Parametrized Neural Network in Subquadratic
Time
- URL: http://arxiv.org/abs/2112.07628v2
- Date: Fri, 24 Nov 2023 00:38:52 GMT
- Title: Training Multi-Layer Over-Parametrized Neural Network in Subquadratic
Time
- Authors: Zhao Song, Lichen Zhang, Ruizhe Zhang
- Abstract summary: We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function.
In this work, we show how to reduce the training cost per iteration.
- Score: 12.348083977777833
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We consider the problem of training a multi-layer over-parametrized neural
network to minimize the empirical risk induced by a loss function. In the
typical setting of over-parametrization, the network width $m$ is much larger
than the data dimension $d$ and the number of training samples $n$
($m=\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\in
\mathbb{R}^{m\times m}$ per layer. Naively, one has to pay $O(m^2)$ time to
read the weight matrix and evaluate the neural network function in both forward
and backward computation. In this work, we show how to reduce the training cost
per iteration. Specifically, we propose a framework that uses $m^2$ cost only
in the initialization phase and achieves \emph{a truly subquadratic cost per
iteration} in terms of $m$, i.e., $m^{2-\Omega(1)}$ per iteration. Our result
has implications beyond standard over-parametrization theory, as it can be
viewed as designing an efficient data structure on top of a pre-trained large
model to further speed up the fine-tuning process, a core procedure to deploy
large language models (LLM).
Related papers
- Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - A Unified Scheme of ResNet and Softmax [8.556540804058203]
We provide a theoretical analysis of the regression problem: $| langle exp(Ax) + A x, bf 1_n rangle-1 ( exp(Ax) + Ax )
This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before.
arXiv Detail & Related papers (2023-09-23T21:41:01Z) - A Sublinear Adversarial Training Algorithm [13.42699247306472]
We analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted ReLU activation.
We develop an algorithm for adversarial training with time cost $o(m n d)$ per iteration by applying half-space reporting data structure.
arXiv Detail & Related papers (2022-08-10T15:31:40Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Does Preprocessing Help Training Over-parameterized Neural Networks? [19.64638346701198]
We propose two novel preprocessing ideas to bypass the $Omega(mnd)$ barrier.
Our results provide theoretical insights for a large number of previously established fast training methods.
arXiv Detail & Related papers (2021-10-09T18:16:23Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.