Related papers: InfoNCE Induces Gaussian Distribution

InfoNCE Induces Gaussian Distribution

URL: http://arxiv.org/abs/2602.24012v1
Date: Fri, 27 Feb 2026 13:35:58 GMT
Title: InfoNCE Induces Gaussian Distribution
Authors: Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa,
Abstract summary: A loss in contrastive training is InfoNCE and its variants.<n>We show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training.<n>The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
Score: 7.8922077372145685
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

Related papers

When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling [15.148577493784051]
Gaussian equivalence theory (GET) states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates.<n>But numerical experiments show that this equivalence can fail even for simple embeddings under general scaling regimes.<n>We introduce a Conditional Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model.
arXiv Detail & Related papers (2025-12-03T00:23:12Z)
Asymptotic Analysis of Two-Layer Neural Networks after One Gradient Step under Gaussian Mixtures Data with Structure [0.8287206589886879]
We study the training and generalization performance of two-layer neural networks (NNs) after one descent step under structured data.<n>We prove that a high-order model performs equivalent to the nonlinear neural networks under certain conditions.
arXiv Detail & Related papers (2025-03-02T11:28:54Z)
The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications [1.8416014644193066]
We compare learned neural scores to the scores of two kinds of analytically tractable distributions.<n>We claim that the learned neural score is dominated by its linear (Gaussian) approximation for moderate to high noise scales.<n>We show that this allows the skipping of the first 15-30% of sampling steps while maintaining high sample quality.
arXiv Detail & Related papers (2024-12-12T21:31:27Z)
Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models [59.331993845831946]
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties. This paper provides the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models.
arXiv Detail & Related papers (2024-03-03T23:15:48Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Learning Disentangled Discrete Representations [22.5004558029479]
We show the relationship between discrete latent spaces and disentangled representations by replacing the standard Gaussian variational autoencoder with a tailored categorical variational autoencoder. We provide both analytical and empirical findings that demonstrate the advantages of discrete VAEs for learning disentangled representations.
arXiv Detail & Related papers (2023-07-26T12:29:58Z)
Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator. This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z)
A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE. We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Optimal regularizations for data generation with probabilistic graphical models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models. We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z)
Loss function based second-order Jensen inequality and its application to particle variational inference [112.58907653042317]
Particle variational inference (PVI) uses an ensemble of models as an empirical approximation for the posterior distribution. PVI iteratively updates each model with a repulsion force to ensure the diversity of the optimized models. We derive a novel generalization error bound and show that it can be reduced by enhancing the diversity of models.
arXiv Detail & Related papers (2021-06-09T12:13:51Z)
Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI) Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.