$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
- URL: http://arxiv.org/abs/2407.18134v2
- Date: Wed, 11 Sep 2024 21:50:21 GMT
- Title: $\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
- Authors: Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun,
- Abstract summary: We develop an objective that encodes how a sample relates to others.
We train vision models based on similarities in class or text caption descriptions.
Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8%$ on ImageNet and $18.1%$ on ImageNet Real.
- Score: 62.565573316667276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.
Related papers
- LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations [4.680881326162484]
Contrastive instance discrimination methods outperform supervised learning in downstream tasks such as image classification and object detection.
A common augmentation technique in contrastive learning is random cropping followed by resizing.
We introduce LeOCLR, a framework that employs a novel instance discrimination approach and an adapted loss function.
arXiv Detail & Related papers (2024-03-11T15:33:32Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Semantic Positive Pairs for Enhancing Visual Representation Learning of Instance Discrimination methods [4.680881326162484]
Self-supervised learning algorithms (SSL) based on instance discrimination have shown promising results.
We propose an approach to identify those images with similar semantic content and treat them as positive instances.
We run experiments on three benchmark datasets: ImageNet, STL-10 and CIFAR-10 with different instance discrimination SSL approaches.
arXiv Detail & Related papers (2023-06-28T11:47:08Z) - MSVQ: Self-Supervised Learning with Multiple Sample Views and Queues [10.327408694770709]
We propose a new simple framework, namely Multiple Sample Views and Queues (MSVQ)
We jointly construct three soft labels on-the-fly by utilizing two complementary and symmetric approaches.
Let the student network mimic the similarity relationships between the samples, thus giving the student network a more flexible ability to identify false negative samples in the dataset.
arXiv Detail & Related papers (2023-05-09T12:05:14Z) - Weakly Supervised Contrastive Learning [68.47096022526927]
We introduce a weakly supervised contrastive learning framework (WCL) to tackle this issue.
WCL achieves 65% and 72% ImageNet Top-1 Accuracy using ResNet50, which is even higher than SimCLRv2 with ResNet101.
arXiv Detail & Related papers (2021-10-10T12:03:52Z) - Bag of Instances Aggregation Boosts Self-supervised Learning [122.61914701794296]
We propose a simple but effective distillation strategy for unsupervised learning.
Our method, termed as BINGO, targets at transferring the relationship learned by the teacher to the student.
BINGO achieves new state-of-the-art performance on small scale models.
arXiv Detail & Related papers (2021-07-04T17:33:59Z) - Doubly Contrastive Deep Clustering [135.7001508427597]
We present a novel Doubly Contrastive Deep Clustering (DCDC) framework, which constructs contrastive loss over both sample and class views.
Specifically, for the sample view, we set the class distribution of the original sample and its augmented version as positive sample pairs.
For the class view, we build the positive and negative pairs from the sample distribution of the class.
In this way, two contrastive losses successfully constrain the clustering results of mini-batch samples in both sample and class level.
arXiv Detail & Related papers (2021-03-09T15:15:32Z) - Un-Mix: Rethinking Image Mixtures for Unsupervised Visual Representation
Learning [108.999497144296]
Recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations.
This work aims to involve the distance concept on label space in the unsupervised learning and let the model be aware of the soft degree of similarity between positive or negative pairs.
Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space.
arXiv Detail & Related papers (2020-03-11T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.