Geometric compression of invariant manifolds in neural nets
- URL: http://arxiv.org/abs/2007.11471v4
- Date: Thu, 11 Mar 2021 08:58:04 GMT
- Title: Geometric compression of invariant manifolds in neural nets
- Authors: Jonas Paccolat, Leonardo Petrini, Mario Geiger, Kevin Tyloo and
Matthieu Wyart
- Abstract summary: We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions.
We show that for a one-hidden layer FC network trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d_perp=d-d_parallel$ uninformative directions.
Next we show that compression shapes the Neural Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels.
- Score: 2.461575510055098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how neural networks compress uninformative input space in models
where data lie in $d$ dimensions, but whose label only vary within a linear
manifold of dimension $d_\parallel < d$. We show that for a one-hidden layer
network initialized with infinitesimal weights (i.e. in the feature learning
regime) trained with gradient descent, the first layer of weights evolve to
become nearly insensitive to the $d_\perp=d-d_\parallel$ uninformative
directions. These are effectively compressed by a factor $\lambda\sim
\sqrt{p}$, where $p$ is the size of the training set. We quantify the benefit
of such a compression on the test error $\epsilon$. For large initialization of
the weights (the lazy training regime), no compression occurs and for regular
boundaries separating labels we find that $\epsilon \sim p^{-\beta}$, with
$\beta_\text{Lazy} = d / (3d-2)$. Compression improves the learning curves so
that $\beta_\text{Feature} = (2d-1)/(3d-2)$ if $d_\parallel = 1$ and
$\beta_\text{Feature} = (d + d_\perp/2)/(3d-2)$ if $d_\parallel > 1$. We test
these predictions for a stripe model where boundaries are parallel interfaces
($d_\parallel=1$) as well as for a cylindrical boundary ($d_\parallel=2$). Next
we show that compression shapes the Neural Tangent Kernel (NTK) evolution in
time, so that its top eigenvectors become more informative and display a larger
projection on the labels. Consequently, kernel learning with the frozen NTK at
the end of training outperforms the initial NTK. We confirm these predictions
both for a one-hidden layer FC network trained on the stripe model and for a
16-layers CNN trained on MNIST, for which we also find
$\beta_\text{Feature}>\beta_\text{Lazy}$.
Related papers
- Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - An Exponential Improvement on the Memorization Capacity of Deep
Threshold Networks [40.489350374378645]
We prove that $widetildemathcalO(e1/delta2+sqrtn)$ neurons and $widetildemathcalO(fracddelta+n)$ weights are sufficient.
We also prove new lower bounds by connecting in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
arXiv Detail & Related papers (2021-06-14T19:42:32Z) - On the emergence of tetrahedral symmetry in the final and penultimate
layers of neural network classifiers [9.975163460952045]
We show that even the final output of the classifier $h$ is not uniform over data samples from a class $C_i$ if $h$ is a shallow network.
We explain this observation analytically in toy models for highly expressive deep neural networks.
arXiv Detail & Related papers (2020-12-10T02:32:52Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Deep Learning Meets Projective Clustering [66.726500395069]
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
arXiv Detail & Related papers (2020-10-08T22:47:48Z) - How isotropic kernels perform on simple invariants [0.5729426778193397]
We investigate how the training curve of isotropic kernel methods depends on the symmetry of the task to be learned.
We show that for large bandwidth, $beta = fracd-1+xi3d-3+xi$, where $xiin (0,2)$ is the exponent characterizing the stripe of the kernel at the origin.
arXiv Detail & Related papers (2020-06-17T09:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.