High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation
- URL: http://arxiv.org/abs/2205.01445v1
- Date: Tue, 3 May 2022 12:09:59 GMT
- Title: High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation
- Authors: Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg
Yang
- Abstract summary: We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
- Score: 89.21686761957383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the first gradient descent step on the first-layer parameters
$\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) =
\frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$,
where $\boldsymbol{W}\in\mathbb{R}^{d\times N},
\boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training
objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n
(f(\boldsymbol{x}_i)-y_i)^2$. In the proportional asymptotic limit where
$n,d,N\to\infty$ at the same rate, and an idealized student-teacher setting, we
show that the first gradient update contains a rank-1 "spike", which results in
an alignment between the first-layer weights and the linear component of the
teacher model $f^*$. To characterize the impact of this alignment, we compute
the prediction risk of ridge regression on the conjugate kernel after one
gradient step on $\boldsymbol{W}$ with learning rate $\eta$, when $f^*$ is a
single-index model. We consider two scalings of the first step learning rate
$\eta$. For small $\eta$, we establish a Gaussian equivalence property for the
trained feature map, and prove that the learned kernel improves upon the
initial random features model, but cannot defeat the best linear model on the
input. Whereas for sufficiently large $\eta$, we prove that for certain $f^*$,
the same ridge estimator on trained features can go beyond this "linear regime"
and outperform a wide range of random features and rotationally invariant
kernels. Our results demonstrate that even one gradient step can lead to a
considerable advantage over random features, and highlight the role of learning
rate scaling in the initial phase of training.
Related papers
- Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis [45.05072391903122]
Information exponent plays important role in predicting sample complexity of online gradient descent.
For multi-index models, focusing solely on the lowest degree can miss key structural details.
We show that by considering both second- and higher-order terms, we can first learn the relevant space via the second-order terms.
arXiv Detail & Related papers (2024-10-13T00:14:08Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Solving Quadratic Systems with Full-Rank Matrices Using Sparse or Generative Priors [33.0212223058894]
The problem of recovering a signal from a quadratic system $y_i=boldsymbol xtopboldsymbol A_iboldsymbol x, i=1,ldots,m$ with full-rank matrices $boldsymbol A_i$ frequently arises in applications such as unassigned distance geometry and sub-wavelength imaging.
This paper addresses the high-dimensional case where $mll n$ incorporating by prior knowledge of $boldsymbol x$.
arXiv Detail & Related papers (2023-09-16T16:00:07Z) - Depth Dependence of $\mu$P Learning Rates in ReLU MLPs [72.14317069090407]
We study the dependence on $n$ and $L$ of the maximal update ($mu$P) learning rate.
We find that it has a non-trivial dependence of $L$, scaling like $L-3/2.$
arXiv Detail & Related papers (2023-05-13T01:10:49Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.