Analysis of feature learning in weight-tied autoencoders via the mean
field lens
- URL: http://arxiv.org/abs/2102.08373v1
- Date: Tue, 16 Feb 2021 18:58:37 GMT
- Title: Analysis of feature learning in weight-tied autoencoders via the mean
field lens
- Authors: Phan-Minh Nguyen
- Abstract summary: We analyze a class of two-layer weight-tied nonlinear autoencoders in the mean field framework.
Models trained with gradient descent are shown to admit a mean field limiting dynamics.
Experiments on real-life data demonstrate an interesting match with the theory.
- Score: 3.553493344868413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoencoders are among the earliest introduced nonlinear models for
unsupervised learning. Although they are widely adopted beyond research, it has
been a longstanding open problem to understand mathematically the feature
extraction mechanism that trained nonlinear autoencoders provide.
In this work, we make progress in this problem by analyzing a class of
two-layer weight-tied nonlinear autoencoders in the mean field framework. Upon
a suitable scaling, in the regime of a large number of neurons, the models
trained with stochastic gradient descent are shown to admit a mean field
limiting dynamics. This limiting description reveals an asymptotically precise
picture of feature learning by these models: their training dynamics exhibit
different phases that correspond to the learning of different principal
subspaces of the data, with varying degrees of nonlinear shrinkage dependent on
the $\ell_{2}$-regularization and stopping time. While we prove these results
under an idealized assumption of (correlated) Gaussian data, experiments on
real-life data demonstrate an interesting match with the theory.
The autoencoder setup of interests poses a nontrivial mathematical challenge
to proving these results. In this setup, the "Lipschitz" constants of the
models grow with the data dimension $d$. Consequently an adaptation of previous
analyses requires a number of neurons $N$ that is at least exponential in $d$.
Our main technical contribution is a new argument which proves that the
required $N$ is only polynomial in $d$. We conjecture that $N\gg d$ is
sufficient and that $N$ is necessarily larger than a data-dependent intrinsic
dimension, a behavior that is fundamentally different from previously studied
setups.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Random features and polynomial rules [0.0]
We present a generalization of the performance of random features models for generic supervised learning problems with Gaussian data.
We find good agreement far from the limits where $Dto infty$ and at least one between $P/DK$, $N/DL$ remains finite.
arXiv Detail & Related papers (2024-02-15T18:09:41Z) - An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws [24.356906682593532]
We study the compute-optimal trade-off between model and training data set sizes for large neural networks.
Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla.
arXiv Detail & Related papers (2022-12-02T18:46:41Z) - FeDXL: Provable Federated Learning for Deep X-Risk Optimization [105.17383135458897]
We tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing algorithms are applicable.
The challenges for designing an FL algorithm for X-risks lie in the non-decomability of the objective over multiple machines and the interdependency between different machines.
arXiv Detail & Related papers (2022-10-26T00:23:36Z) - Easy Differentially Private Linear Regression [16.325734286930764]
We study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models.
We find that this algorithm obtains strong empirical performance in the data-rich setting.
arXiv Detail & Related papers (2022-08-15T17:42:27Z) - Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z) - A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$.
The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$.
This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.