Beyond Lazy Training for Over-parameterized Tensor Decomposition
- URL: http://arxiv.org/abs/2010.11356v1
- Date: Thu, 22 Oct 2020 00:32:12 GMT
- Title: Beyond Lazy Training for Over-parameterized Tensor Decomposition
- Authors: Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge
- Abstract summary: We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
- Score: 69.4699995828506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over-parametrization is an important technique in training neural networks.
In both theory and practice, training a larger network allows the optimization
algorithm to avoid bad local optimal solutions. In this paper we study a
closely related tensor decomposition problem: given an $l$-th order tensor in
$(R^d)^{\otimes l}$ of rank $r$ (where $r\ll d$), can variants of gradient
descent find a rank $m$ decomposition where $m > r$? We show that in a lazy
training regime (similar to the NTK regime for neural networks) one needs at
least $m = \Omega(d^{l-1})$, while a variant of gradient descent can find an
approximate tensor when $m = O^*(r^{2.5l}\log d)$. Our results show that
gradient descent on over-parametrized objective could go beyond the lazy
training regime and utilize certain low-rank structure in the data.
Related papers
- Overcomplete Tensor Decomposition via Koszul-Young Flattenings [63.01248796170617]
We give a new algorithm for decomposing an $n_times n times n_3$ tensor as the sum of a minimal number of rank-1 terms.
We show that an even more general class of degree-$d$s cannot surpass rank $Cn$ for a constant $C = C(d)$.
arXiv Detail & Related papers (2024-11-21T17:41:09Z) - Adapting Newton's Method to Neural Networks through a Summary of
Higher-Order Derivatives [0.0]
We consider a gradient-based optimization method applied to a function $boldsymboltheta$.
This framework encompasses many common use-cases, such as training neural networks by gradient descent.
arXiv Detail & Related papers (2023-12-06T20:24:05Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Average-Case Complexity of Tensor Decomposition for Low-Degree
Polynomials [93.59919600451487]
"Statistical-computational gaps" occur in many statistical inference tasks.
We consider a model for random order-3 decomposition where one component is slightly larger in norm than the rest.
We show that tensor entries can accurately estimate the largest component when $ll n3/2$ but fail to do so when $rgg n3/2$.
arXiv Detail & Related papers (2022-11-10T00:40:37Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor
Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions.
For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time.
In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Training (Overparametrized) Neural Networks in Near-Linear Time [21.616949485102342]
We show how to speed up the algorithm of [CGH+1] for training (mildly overetrized) ReparamLU networks.
The centerpiece of our algorithm is to reformulate the Gauss-Newton as an $ell$-recondition.
arXiv Detail & Related papers (2020-06-20T20:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.