A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision
- URL: http://arxiv.org/abs/2112.13884v1
- Date: Mon, 27 Dec 2021 20:02:10 GMT
- Title: A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision
- Authors: Ajinkya Tejankar, Ajinkya Tejankar, Bichen Wu, Saining Xie, Madian
Khabsa, Hamed Pirsiavash, Hamed Firooz
- Abstract summary: In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
- Score: 32.4697157553247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using natural language as a supervision for training visual recognition
models holds great promise. Recent works have shown that if such supervision is
used in the form of alignment between images and captions in large training
datasets, then the resulting aligned models perform well on zero-shot
classification as downstream tasks2. In this paper, we focus on teasing out
what parts of the language supervision are essential for training zero-shot
image classification models. Through extensive and careful experiments, we show
that: 1) A simple Bag-of-Words (BoW) caption could be used as a replacement for
most of the image captions in the dataset. Surprisingly, we observe that this
approach improves the zero-shot classification performance when combined with
word balancing. 2) Using a BoW pretrained model, we can obtain more training
data by generating pseudo-BoW captions on images that do not have a caption.
Models trained on images with real and pseudo-BoW captions achieve stronger
zero-shot performance. On ImageNet-1k zero-shot evaluation, our best model,
that uses only 3M image-caption pairs, performs on-par with a CLIP model
trained on 15M image-caption pairs (31.5% vs 31.3%).
Related papers
- Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Scaling Up Vision-Language Pre-training for Image Captioning [51.639880603821446]
We present LEMON, a LargE-scale iMage captiONer for image captioning.
We show LEMON achieves new state of the arts on several major image captioning benchmarks.
arXiv Detail & Related papers (2021-11-24T02:30:22Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.