SLIP: Self-supervision meets Language-Image Pre-training
- URL: http://arxiv.org/abs/2112.12750v1
- Date: Thu, 23 Dec 2021 18:07:13 GMT
- Title: SLIP: Self-supervision meets Language-Image Pre-training
- Authors: Norman Mu, Alexander Kirillov, David Wagner, Saining Xie
- Abstract summary: We study whether self-supervised learning can aid in the use of language supervision for visual representation learning.
We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training.
We find that SLIP enjoys the best of both worlds: better performance than self-supervision and language supervision.
- Score: 79.53764315471543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that self-supervised pre-training leads to improvements
over supervised learning on challenging visual recognition tasks. CLIP, an
exciting new approach to learning with language supervision, demonstrates
promising performance on a wide variety of benchmarks. In this work, we explore
whether self-supervised learning can aid in the use of language supervision for
visual representation learning. We introduce SLIP, a multi-task learning
framework for combining self-supervised learning and CLIP pre-training. After
pre-training with Vision Transformers, we thoroughly evaluate representation
quality and compare performance to both CLIP and self-supervised learning under
three distinct settings: zero-shot transfer, linear classification, and
end-to-end finetuning. Across ImageNet and a battery of additional datasets, we
find that SLIP improves accuracy by a large margin. We validate our results
further with experiments on different model sizes, training schedules, and
pre-training datasets. Our findings show that SLIP enjoys the best of both
worlds: better performance than self-supervision (+8.1% linear accuracy) and
language supervision (+5.2% zero-shot accuracy).
Related papers
- What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features.
Our framework is specifically designed to work on web-scraped data by not relying on negative examples and addressing the one-to-one correspondence issue.
arXiv Detail & Related papers (2024-05-23T07:18:08Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Self-Supervision Can Be a Good Few-Shot Learner [42.06243069679068]
We propose an effective unsupervised few-shot learning method, learning representations with self-supervision.
Specifically, we maximize the mutual information (MI) of instances and their representations with a low-bias MI estimator.
We show that self-supervised pre-training can outperform supervised pre-training under the appropriate conditions.
arXiv Detail & Related papers (2022-07-19T10:23:40Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - When Does Contrastive Visual Representation Learning Work? [13.247759411409936]
We study contrastive self-supervised learning on four diverse large-scale datasets.
Our key findings include: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, and (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining.
arXiv Detail & Related papers (2021-05-12T17:52:42Z) - A Simple Framework for Contrastive Learning of Visual Representations [116.37752766922407]
This paper presents SimCLR: a simple framework for contrastive learning of visual representations.
We show that composition of data augmentations plays a critical role in defining effective predictive tasks.
We are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet.
arXiv Detail & Related papers (2020-02-13T18:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.