Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm
- URL: http://arxiv.org/abs/2110.05208v1
- Date: Mon, 11 Oct 2021 12:17:32 GMT
- Title: Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm
- Authors: Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing
Shao, Fengwei Yu, Junjie Yan
- Abstract summary: Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
- Score: 109.0573737034428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has
attracted unprecedented attention for its impressive zero-shot recognition
ability and excellent transferability to downstream tasks. However, CLIP is
quite data-hungry and requires 400M image-text pairs for pre-training, thereby
restricting its adoption. This work proposes a novel training paradigm, Data
efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by
carefully utilizing the widespread supervision among the image-text pairs, our
De-CLIP can learn generic visual features more efficiently. Instead of using
the single image-text contrastive supervision, we fully exploit data potential
through the use of (1) self-supervision within each modality; (2) multi-view
supervision across modalities; (3) nearest-neighbor supervision from other
similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can
achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the
CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its
counterpart in 8 out of 11 visual datasets when transferred to downstream
tasks. Moreover, Scaling up the model and computing also works well in our
framework.Our code, dataset and models are released at:
https://github.com/Sense-GVT/DeCLIP
Related papers
- Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity [11.414069074535007]
Contrastive Language-Image Pre-training on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization.
Small subsets of training data that provably generalize the best has remained an open question.
We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance.
arXiv Detail & Related papers (2024-03-18T21:32:58Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models.
We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.