Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive
Learners With FlatNCE
- URL: http://arxiv.org/abs/2107.01152v1
- Date: Fri, 2 Jul 2021 15:50:43 GMT
- Title: Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive
Learners With FlatNCE
- Authors: Junya Chen, Zhe Gan, Xuan Li, Qing Guo, Liqun Chen, Shuyang Gao,
Tagyoung Chung, Yi Xu, Belinda Zeng, Wenlian Lu, Fan Li, Lawrence Carin,
Chenyang Tao
- Abstract summary: We reveal mathematically why contrastive learners fail in the small-batch-size regime.
We present a novel non-native contrastive objective named FlatNCE, which fixes this issue.
- Score: 104.37515476361405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: InfoNCE-based contrastive representation learners, such as SimCLR, have been
tremendously successful in recent years. However, these contrastive schemes are
notoriously resource demanding, as their effectiveness breaks down with
small-batch training (i.e., the log-K curse, whereas K is the batch-size). In
this work, we reveal mathematically why contrastive learners fail in the
small-batch-size regime, and present a novel simple, non-trivial contrastive
objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no
longer explicitly appeals to a discriminative classification goal for
contrastive learning. Theoretically, we show FlatNCE is the mathematical dual
formulation of InfoNCE, thus bridging the classical literature on energy
modeling; and empirically, we demonstrate that, with minimal modification of
code, FlatNCE enables immediate performance boost independent of the
subject-matter engineering efforts. The significance of this work is furthered
by the powerful generalization of contrastive learning techniques, and the
introduction of new tools to monitor and diagnose contrastive training. We
substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other
datasets, where FlatNCE consistently outperforms InfoNCE.
Related papers
- CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning [52.63674911541416]
Few-shot class-incremental learning (FSCIL) faces several challenges, such as overfitting and forgetting.
Our primary focus is representation learning on base classes to tackle the unique challenge of FSCIL.
We find that trying to secure the spread of features within a more confined feature space enables the learned representation to strike a better balance between transferability and discriminability.
arXiv Detail & Related papers (2024-10-08T02:23:16Z) - Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning [99.05401042153214]
In-context learning (ICL) is potentially attributed to two major abilities: task recognition (TR) and task learning (TL)
We take the first step by examining the pre-training dynamics of the emergence of ICL.
We propose a simple yet effective method to better integrate these two abilities for ICL at inference time.
arXiv Detail & Related papers (2024-06-20T06:37:47Z) - BECLR: Batch Enhanced Contrastive Few-Shot Learning [1.450405446885067]
Unsupervised few-shot learning aspires to bridge this gap by discarding the reliance on annotations at training time.
We propose a novel Dynamic Clustered mEmory (DyCE) module to promote a highly separable latent representation space.
We then tackle the, somehow overlooked yet critical, issue of sample bias at the few-shot inference stage.
arXiv Detail & Related papers (2024-02-04T10:52:43Z) - Unbiased and Efficient Self-Supervised Incremental Contrastive Learning [31.763904668737304]
We propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of a novel Incremental InfoNCE (NCE-II) loss function.
ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.
arXiv Detail & Related papers (2023-01-28T06:11:31Z) - Mitigating Forgetting in Online Continual Learning via Contrasting
Semantically Distinct Augmentations [22.289830907729705]
Online continual learning (OCL) aims to enable model learning from a non-stationary data stream to continuously acquire new knowledge as well as retain the learnt one.
Main challenge comes from the "catastrophic forgetting" issue -- the inability to well remember the learnt knowledge while learning the new ones.
arXiv Detail & Related papers (2022-11-10T05:29:43Z) - Integrating Prior Knowledge in Contrastive Learning with Kernel [4.050766659420731]
We use kernel theory to propose a novel loss, called decoupled uniformity, that i) allows the integration of prior knowledge and ii) removes the negative-positive coupling in the original InfoNCE loss.
In an unsupervised setting, we empirically demonstrate that CL benefits from generative models to improve its representation both on natural and medical images.
arXiv Detail & Related papers (2022-06-03T15:43:08Z) - A Low Rank Promoting Prior for Unsupervised Contrastive Learning [108.91406719395417]
We construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning.
Our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension.
Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks.
arXiv Detail & Related papers (2021-08-05T15:58:25Z) - Heterogeneous Contrastive Learning: Encoding Spatial Information for
Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations.
We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.