Related papers: Understanding Self-Supervised Learning via Gaussian Mixture Models

Understanding Self-Supervised Learning via Gaussian Mixture Models

URL: http://arxiv.org/abs/2411.03517v2
Date: Thu, 06 Feb 2025 18:48:27 GMT
Title: Understanding Self-Supervised Learning via Gaussian Mixture Models
Authors: Parikshit Bansal, Ali Kavis, Sujay Sanghavi,
Abstract summary: We analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models.<n>We show that vanilla contrastive learning is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic.<n>In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations.
Score: 19.51336063093898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic -- something that vanilla spectral techniques cannot do. We also prove a similar result for "non-contrastive" self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.

Related papers

Robustness of Nonlinear Representation Learning [60.15898117103069]
We study the problem of unsupervised representation learning in slightly misspecified settings. We show that the mixing can be identified up to linear transformations and small errors. Those results are a step towards identifiability results for unsupervised representation learning for real-world data.
arXiv Detail & Related papers (2025-03-19T15:57:03Z)
An Information Theoretic Approach to Machine Unlearning [43.423418819707784]
To comply with AI and data regulations, the need to forget private or copyrighted information from trained machine learning models is increasingly important. In this work, we address the zero-shot unlearning scenario, whereby an unlearning algorithm must be able to remove data given only a trained model and the data to be forgotten. We derive a simple but principled zero-shot unlearning method based on the geometry of the model.
arXiv Detail & Related papers (2024-02-02T13:33:30Z)
Hodge-Aware Contrastive Learning [101.56637264703058]
Simplicial complexes prove effective in modeling data with multiway dependencies. We develop a contrastive self-supervised learning approach for processing simplicial data.
arXiv Detail & Related papers (2023-09-14T00:40:07Z)
Toward Understanding Generative Data Augmentation [16.204251285425478]
We show that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(maxleft( log(m)beta_m, 1 / sqrtm)right)$. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small.
arXiv Detail & Related papers (2023-05-27T13:46:08Z)
Contrastive Learning Is Spectral Clustering On Similarity Graph [12.47963220169677]
We show that contrastive learning with the standard InfoNCE loss is equivalent to spectral clustering on the similarity graph. Motivated by our theoretical insights, we introduce the Kernel-InfoNCE loss.
arXiv Detail & Related papers (2023-03-27T11:13:35Z)
On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings. Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data. Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z)
InfoNCE Loss Provably Learns Cluster-Preserving Representations [54.28112623495274]
We show that the representation learned by InfoNCE with a finite number of negative samples is consistent with respect to clusters in the data. Our main result is to show that the representation learned by InfoNCE with a finite number of negative samples is also consistent with respect to clusters in the data.
arXiv Detail & Related papers (2023-02-15T19:45:35Z)
Learning sparse features can lead to overfitting in neural networks [9.2104922520782]
We show that feature learning can perform worse than lazy training. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth.
arXiv Detail & Related papers (2022-06-24T14:26:33Z)
Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations. We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque. Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Understanding self-supervised Learning Dynamics without Contrastive Pairs [72.1743263777693]
Contrastive approaches to self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point. BYOL and SimSiam, show remarkable performance it without negative pairs. We study the nonlinear learning dynamics of non-contrastive SSL in simple linear networks.
arXiv Detail & Related papers (2021-02-12T22:57:28Z)
Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning. We propose a novel method of using data augmentations when training autoencoders. We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.