Training (Overparametrized) Neural Networks in Near-Linear Time
- URL: http://arxiv.org/abs/2006.11648v2
- Date: Wed, 9 Dec 2020 02:02:10 GMT
- Title: Training (Overparametrized) Neural Networks in Near-Linear Time
- Authors: Jan van den Brand, Binghui Peng, Zhao Song, Omri Weinstein
- Abstract summary: We show how to speed up the algorithm of [CGH+1] for training (mildly overetrized) ReparamLU networks.
The centerpiece of our algorithm is to reformulate the Gauss-Newton as an $ell$-recondition.
- Score: 21.616949485102342
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster $\mathit{second}$-$\mathit{order}$ optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ($\mathit{independent}$ of the training batch
size $n$), second-order algorithms incur a daunting slowdown in the
$\mathit{cost}$ $\mathit{per}$ $\mathit{iteration}$ (inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an $O(mn^2)$-time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width $m$.
We show how to speed up the algorithm of [CGH+19], achieving an
$\tilde{O}(mn)$-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ($mn$)
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an $\ell_2$-regression problem, and
then use a Fast-JL type dimension reduction to $\mathit{precondition}$ the
underlying Gram matrix in time independent of $M$, allowing to find a
sufficiently good approximate solution via $\mathit{first}$-$\mathit{order}$
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in $\mathit{convex}$ $\mathit{optimization}$ (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well.
Related papers
- Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks [46.04785603483612]
We prove that Nesterov's accelerated gradient attains an complexity $O(kappalogfrac1epsilon)$.
In particular, we prove that NAG can also attain an accelerated linear convergence rate.
arXiv Detail & Related papers (2024-10-12T20:33:37Z) - Faster Linear Systems and Matrix Norm Approximation via Multi-level Sketched Preconditioning [10.690769339903941]
We present a new class of preconditioned iterative methods for solving linear systems of the form $Ax = b$.
Our methods are based on constructing a low-rank Nystr"om approximation to $A$ using sparse random sketching.
We prove that the convergence of our methods depends on a natural average condition number of $A$, which improves as the rank of the Nystr"om approximation.
arXiv Detail & Related papers (2024-05-09T15:53:43Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Randomized Block-Coordinate Optimistic Gradient Algorithms for
Root-Finding Problems [8.0153031008486]
We develop two new algorithms to approximate a solution of nonlinear equations in large-scale settings.
We apply our methods to a class of large-scale finite-sum inclusions, which covers prominent applications in machine learning.
arXiv Detail & Related papers (2023-01-08T21:46:27Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
We give a sketching-based iterative algorithm that computes $1+varepsilon$ approximate solutions for the ridge regression problem.
We also show that this algorithm can be used to give faster algorithms for kernel ridge regression.
arXiv Detail & Related papers (2022-04-13T22:18:47Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.