Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation
- URL: http://arxiv.org/abs/2112.09445v3
- Date: Sun, 17 Dec 2023 19:47:04 GMT
- Title: Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation
- Authors: Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Tianren Gao, Peter Vajda,
Joseph E. Gonzalez
- Abstract summary: We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
- Score: 43.03533959429743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional computer vision models are trained to predict a fixed set of
predefined categories. Recently, natural language has been shown to be a
broader and richer source of supervision that provides finer descriptions to
visual concepts than supervised "gold" labels. Previous works, such as CLIP,
use InfoNCE loss to train a model to predict the pairing between images and
text captions. CLIP, however, is data hungry and requires more than 400M
image-text pairs for training. The inefficiency can be partially attributed to
the fact that the image-text pairs are noisy. To address this, we propose OTTER
(Optimal TransporT distillation for Efficient zero-shot Recognition), which
uses online entropic optimal transport to find a soft image-text match as
labels for contrastive learning. Based on pretrained image and text encoders,
models trained with OTTER achieve strong performance with only 3M image text
pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation,
OTTER consistently outperforms these baselines in zero shot evaluation on
Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032
classes) from Tencent ML-Images. Over 42 evaluations on 7 different
dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2)
all baselines in 34 of them.
Related papers
- Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining [68.84339672878066]
We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
arXiv Detail & Related papers (2022-04-29T13:38:42Z) - A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision [32.4697157553247]
In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
arXiv Detail & Related papers (2021-12-27T20:02:10Z) - Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation [23.631184498984933]
Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
arXiv Detail & Related papers (2021-04-18T19:55:31Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining [39.24803665848558]
We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM)
We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data.
VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.
arXiv Detail & Related papers (2021-02-20T18:02:42Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.