Image Captioners Are Scalable Vision Learners Too
- URL: http://arxiv.org/abs/2306.07915v5
- Date: Thu, 21 Dec 2023 18:24:08 GMT
- Title: Image Captioners Are Scalable Vision Learners Too
- Authors: Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil
Houlsby, Lucas Beyer
- Abstract summary: Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones.
Our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
- Score: 61.98796478791261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive pretraining on image-text pairs from the web is one of the most
popular large-scale pretraining strategies for vision backbones, especially in
the context of large multimodal models. At the same time, image captioning on
this type of data is commonly considered an inferior pretraining strategy. In
this paper, we perform a fair comparison of these two pretraining strategies,
carefully matching training data, compute, and model capacity. Using a standard
encoder-decoder transformer, we find that captioning alone is surprisingly
effective: on classification tasks, captioning produces vision encoders
competitive with contrastively pretrained encoders, while surpassing them on
vision & language tasks. We further analyze the effect of the model
architecture and scale, as well as the pretraining data on the representation
quality, and find that captioning exhibits the same or better scaling behavior
along these axes. Overall our results show that plain image captioning is a
more powerful pretraining strategy than was previously believed.
Related papers
- Bidirectional Captioning for Clinically Accurate and Interpretable
Models [4.355562946859011]
Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks.
In this paper, we experiment with bidirectional captioning of radiology reports as a form of pretraining and compare the quality and utility of learned embeddings with those from contrastive pretraining methods.
Results show that not only does captioning pretraining yield visual encoders that are competitive with contrastive pretraining (CheXpert competition multi-label AUC of 89.4%), but also that our transformer decoder is capable of generating clinically relevant reports.
arXiv Detail & Related papers (2023-10-30T15:25:29Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Cross-Modal Similarity-Based Curriculum Learning for Image Captioning [46.18855398491187]
We propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model.
Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines.
arXiv Detail & Related papers (2022-12-14T07:52:36Z) - Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Scaling Up Vision-Language Pre-training for Image Captioning [51.639880603821446]
We present LEMON, a LargE-scale iMage captiONer for image captioning.
We show LEMON achieves new state of the arts on several major image captioning benchmarks.
arXiv Detail & Related papers (2021-11-24T02:30:22Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.