Three Mechanisms of Feature Learning in the Exact Solution of a Latent Variable Model
- URL: http://arxiv.org/abs/2401.07085v2
- Date: Sat, 4 May 2024 12:43:04 GMT
- Title: Three Mechanisms of Feature Learning in the Exact Solution of a Latent Variable Model
- Authors: Yizhou Xu, Liu Ziyin,
- Abstract summary: We identify and exactly solve the learning dynamics of a one-hidden-layer linear model at any finite width.
Our solution identifies three novel prototype mechanisms of feature learning: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling.
In sharp contrast, none of these mechanisms is present in the kernel regime of the model.
- Score: 0.34530027457862006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We identify and exactly solve the learning dynamics of a one-hidden-layer linear model at any finite width whose limits exhibit both the kernel phase and the feature learning phase. We analyze the phase diagram of this model in different limits of common hyperparameters including width, layer-wise learning rates, scale of output, and scale of initialization. Our solution identifies three novel prototype mechanisms of feature learning: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling. In sharp contrast, none of these mechanisms is present in the kernel regime of the model. We empirically demonstrate that these discoveries also appear in deep nonlinear networks in real tasks.
Related papers
- Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective [26.479602180023125]
The Linear Complexity Sequence Model (LCSM) unites various sequence modeling techniques with linear complexity.
We segment the modeling processes of these models into three distinct stages: Expand, Oscillation, and Shrink.
We perform experiments to analyze the impact of different stage settings on language modeling and retrieval tasks.
arXiv Detail & Related papers (2024-05-27T17:38:55Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - Exploiting the capacity of deep networks only at training stage for
nonlinear black-box system identification [0.0]
This study presents a novel training strategy that uses deep models only at the training stage.
The proposed objective function consists of the objective of each student and teacher model adding up with a distance penalty between the learned latent representations.
arXiv Detail & Related papers (2023-12-26T09:32:42Z) - On the Stepwise Nature of Self-Supervised Learning [0.0]
We present a simple picture of the training process of joint embedding self-supervised learning methods.
We find that these methods learn their high-dimensional embeddings one dimension at a time in a sequence of discrete, well-separated steps.
Our theory suggests that, just as kernel regression can be thought of as a model of supervised learning, kernel PCA may serve as a useful model of self-supervised learning.
arXiv Detail & Related papers (2023-03-27T17:59:20Z) - A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - Understanding the Role of Nonlinearity in Training Dynamics of
Contrastive Learning [37.27098255569438]
We study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks.
We show that the presence of nonlinearity leads to many local optima even in 1-layer setting.
For 2-layer setting, we also discover emphglobal modulation: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn.
arXiv Detail & Related papers (2022-06-02T23:52:35Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Phase diagram for two-layer ReLU neural networks at infinite-width limit [6.380166265263755]
We draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit.
We identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime.
In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay.
In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations.
arXiv Detail & Related papers (2020-07-15T06:04:35Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.