Understanding the Role of Nonlinearity in Training Dynamics of
Contrastive Learning
- URL: http://arxiv.org/abs/2206.01342v1
- Date: Thu, 2 Jun 2022 23:52:35 GMT
- Title: Understanding the Role of Nonlinearity in Training Dynamics of
Contrastive Learning
- Authors: Yuandong Tian
- Abstract summary: We study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks.
We show that the presence of nonlinearity leads to many local optima even in 1-layer setting.
For 2-layer setting, we also discover emphglobal modulation: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn.
- Score: 37.27098255569438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the empirical success of self-supervised learning (SSL) heavily relies
on the usage of deep nonlinear models, many theoretical works proposed to
understand SSL still focus on linear ones. In this paper, we study the role of
nonlinearity in the training dynamics of contrastive learning (CL) on one and
two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We
theoretically demonstrate that (1) the presence of nonlinearity leads to many
local optima even in 1-layer setting, each corresponding to certain patterns
from the data distribution, while with linear activation, only one major
pattern can be learned; and (2) nonlinearity leads to specialized weights into
diverse patterns, a behavior that linear activation is proven not capable of.
These findings suggest that models with lots of parameters can be regarded as a
\emph{brute-force} way to find these local optima induced by nonlinearity, a
possible underlying reason why empirical observations such as the lottery
ticket hypothesis hold. In addition, for 2-layer setting, we also discover
\emph{global modulation}: those local patterns discriminative from the
perspective of global-level patterns are prioritized to learn, further
characterizing the learning process. Simulation verifies our theoretical
findings.
Related papers
- Understanding Representation Learnability of Nonlinear Self-Supervised
Learning [13.965135660149212]
Self-supervised learning (SSL) has empirically shown its data representation learnability in many downstream tasks.
Our paper is the first to analyze the learning results of the nonlinear SSL model accurately.
arXiv Detail & Related papers (2024-01-06T13:23:26Z) - Learning Linearized Models from Nonlinear Systems with Finite Data [1.6026317505839445]
We consider the problem of identifying a linearized model when the true underlying dynamics is nonlinear.
We provide a multiple trajectories-based deterministic data acquisition algorithm followed by a regularized least squares algorithm.
Our error bound demonstrates a trade-off between the error due to nonlinearity and the error due to noise, and shows that one can learn the linearized dynamics with arbitrarily small error.
arXiv Detail & Related papers (2023-09-15T22:58:03Z) - Understanding Multi-phase Optimization Dynamics and Rich Nonlinear
Behaviors of ReLU Networks [8.180184504355571]
We conduct a theoretical characterization of the training process of two-layer ReLU network trained by Gradient Flow on a linearlyparable data.
We reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend.
Specific nonlinear behaviors can also be precisely identified captured theoretically, such as initial, saddle-plateau dynamics, condensation escape, changes of activation patterns with increasing complexity.
arXiv Detail & Related papers (2023-05-21T14:08:34Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - Linearization and Identification of Multiple-Attractors Dynamical System
through Laplacian Eigenmaps [8.161497377142584]
We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data-points belonging to the same dynamics.
We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear, and n-dimensional embedding where they are quasi-linear.
We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time.
arXiv Detail & Related papers (2022-02-18T12:43:25Z) - Hessian Eigenspectra of More Realistic Nonlinear Models [73.31363313577941]
We make a emphprecise characterization of the Hessian eigenspectra for a broad family of nonlinear models.
Our analysis takes a step forward to identify the origin of many striking features observed in more complex machine learning models.
arXiv Detail & Related papers (2021-03-02T06:59:52Z) - Nonlinear Invariant Risk Minimization: A Causal Approach [5.63479133344366]
We propose a learning paradigm that enables out-of-distribution generalization in the nonlinear setting.
We show identifiability of the data representation up to very simple transformations.
Extensive experiments on both synthetic and real-world datasets show that our approach significantly outperforms a variety of baseline methods.
arXiv Detail & Related papers (2021-02-24T15:38:41Z) - Understanding self-supervised Learning Dynamics without Contrastive
Pairs [72.1743263777693]
Contrastive approaches to self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point.
BYOL and SimSiam, show remarkable performance it without negative pairs.
We study the nonlinear learning dynamics of non-contrastive SSL in simple linear networks.
arXiv Detail & Related papers (2021-02-12T22:57:28Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.