Scaling Up Vision-Language Pre-training for Image Captioning
- URL: http://arxiv.org/abs/2111.12233v1
- Date: Wed, 24 Nov 2021 02:30:22 GMT
- Title: Scaling Up Vision-Language Pre-training for Image Captioning
- Authors: Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao
Lu, Lijuan Wang
- Abstract summary: We present LEMON, a LargE-scale iMage captiONer for image captioning.
We show LEMON achieves new state of the arts on several major image captioning benchmarks.
- Score: 51.639880603821446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, we have witnessed significant performance boost in the image
captioning task based on vision-language pre-training (VLP). Scale is believed
to be an important factor for this advance. However, most existing work only
focuses on pre-training transformers with moderate sizes (e.g., 12 or 24
layers) on roughly 4 million images. In this paper, we present LEMON, a
LargE-scale iMage captiONer, and provide the first empirical study on the
scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL
model as our reference model, which consists of an image feature extractor and
a transformer model, and scale the transformer both up and down, with model
sizes ranging from 13 to 675 million parameters. In terms of data, we conduct
experiments with up to 200 million image-text pairs which are automatically
collected from web based on the alt attribute of the image (dubbed as ALT200M).
Extensive analysis helps to characterize the performance trend as the model
size and the pre-training data size increase. We also compare different
training recipes, especially for training on large-scale noisy data. As a
result, LEMON achieves new state of the arts on several major image captioning
benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also
show LEMON can generate captions with long-tail visual concepts when used in a
zero-shot manner.
Related papers
- Déjà Vu Memorization in Vision-Language Models [39.51189095703773]
We propose a new method for measuring memorization in Vision-Language Models (VLMs)
We show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption.
We evaluate d'eja vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs.
arXiv Detail & Related papers (2024-02-03T09:55:35Z) - The Solution for the CVPR2023 NICE Image Captioning Challenge [11.37047794237074]
We present our solution to the New frontiers for Zero-shot Image Captioning Challenge.
This challenge includes a larger new variety of visual concepts from many domains.
For the data level, we collect external training data from Laion-5B.
For the model level, we use OFA, a large-scale visual-language pre-training model.
arXiv Detail & Related papers (2023-10-10T09:09:41Z) - Image Captioners Are Scalable Vision Learners Too [61.98796478791261]
Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones.
Our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
arXiv Detail & Related papers (2023-06-13T17:18:01Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision [32.4697157553247]
In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
arXiv Detail & Related papers (2021-12-27T20:02:10Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.