Related papers: Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning

Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning

URL: http://arxiv.org/abs/2206.01342v1
Date: Thu, 2 Jun 2022 23:52:35 GMT
Title: Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning
Authors: Yuandong Tian
Abstract summary: We study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks. We show that the presence of nonlinearity leads to many local optima even in 1-layer setting. For 2-layer setting, we also discover emphglobal modulation: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn.
Score: 37.27098255569438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, many theoretical works proposed to understand SSL still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We theoretically demonstrate that (1) the presence of nonlinearity leads to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned; and (2) nonlinearity leads to specialized weights into diverse patterns, a behavior that linear activation is proven not capable of. These findings suggest that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity, a possible underlying reason why empirical observations such as the lottery ticket hypothesis hold. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.

Related papers

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data [7.0164889385584415]
Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks.
arXiv Detail & Related papers (2025-01-04T19:43:21Z)
Self-supervised contrastive learning performs non-linear system identification [2.393499494583001]
We show that self-supervised learning can perform system identification in latent space. We propose DynCL, a framework to uncover linear, switching linear and non-linear dynamics under a non-linear observation model.
arXiv Detail & Related papers (2024-10-18T17:59:25Z)
Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning. We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve. This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z)
Understanding Representation Learnability of Nonlinear Self-Supervised Learning [13.965135660149212]
Self-supervised learning (SSL) has empirically shown its data representation learnability in many downstream tasks. Our paper is the first to analyze the learning results of the nonlinear SSL model accurately.
arXiv Detail & Related papers (2024-01-06T13:23:26Z)
Learning Linearized Models from Nonlinear Systems with Finite Data [1.6026317505839445]
We consider the problem of identifying a linearized model when the true underlying dynamics is nonlinear. We provide a multiple trajectories-based deterministic data acquisition algorithm followed by a regularized least squares algorithm. Our error bound demonstrates a trade-off between the error due to nonlinearity and the error due to noise, and shows that one can learn the linearized dynamics with arbitrarily small error.
arXiv Detail & Related papers (2023-09-15T22:58:03Z)
Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks [8.180184504355571]
We conduct a theoretical characterization of the training process of two-layer ReLU network trained by Gradient Flow on a linearlyparable data. We reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend. Specific nonlinear behaviors can also be precisely identified captured theoretically, such as initial, saddle-plateau dynamics, condensation escape, changes of activation patterns with increasing complexity.
arXiv Detail & Related papers (2023-05-21T14:08:34Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Hessian Eigenspectra of More Realistic Nonlinear Models [73.31363313577941]
We make a emphprecise characterization of the Hessian eigenspectra for a broad family of nonlinear models. Our analysis takes a step forward to identify the origin of many striking features observed in more complex machine learning models.
arXiv Detail & Related papers (2021-03-02T06:59:52Z)
Understanding self-supervised Learning Dynamics without Contrastive Pairs [72.1743263777693]
Contrastive approaches to self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point. BYOL and SimSiam, show remarkable performance it without negative pairs. We study the nonlinear learning dynamics of non-contrastive SSL in simple linear networks.
arXiv Detail & Related papers (2021-02-12T22:57:28Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.