Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation
- URL: http://arxiv.org/abs/2104.08945v1
- Date: Sun, 18 Apr 2021 19:55:31 GMT
- Title: Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation
- Authors: Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E.
Gonzalez
- Abstract summary: Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
- Score: 23.631184498984933
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional computer vision models are trained to predict a fixed set of
predefined categories. Recently, natural language has been shown to be a
broader and richer source of supervision that provides finer descriptions to
visual concepts than supervised "gold" labels. Previous works, such as CLIP,
use a simple pretraining task of predicting the pairings between images and
text captions. CLIP, however, is data hungry and requires more than 400M image
text pairs for training. We propose a data-efficient contrastive distillation
method that uses soft labels to learn from noisy image-text pairs. Our model
transfers knowledge from pretrained image and sentence encoders and achieves
strong performance with only 3M image text pairs, 133x smaller than CLIP. Our
method exceeds the previous SoTA of general zero-shot learning on ImageNet
21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text
encoder. We also beat CLIP by 10.5% relatively on zero-shot evaluation on
Google Open Images (19,958 classes).
Related papers
- FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training [21.372374962328948]
Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders.
We propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders.
FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy.
arXiv Detail & Related papers (2024-11-18T09:19:30Z) - Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
Histopathology Images [8.612889476601822]
We present MI-Zero, a framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images.
MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images.
arXiv Detail & Related papers (2023-06-13T15:05:24Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining [68.84339672878066]
We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
arXiv Detail & Related papers (2022-04-29T13:38:42Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining [39.24803665848558]
We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM)
We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data.
VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.
arXiv Detail & Related papers (2021-02-20T18:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.