VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining
- URL: http://arxiv.org/abs/2102.10407v1
- Date: Sat, 20 Feb 2021 18:02:42 GMT
- Title: VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining
- Authors: Jun Chen, Han Guo, Kai Yi, Boyang Li, Mohamed Elhoseiny
- Abstract summary: We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM)
We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data.
VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.
- Score: 39.24803665848558
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we aim to improve the data efficiency of image captioning. We
propose VisualGPT, a data-efficient image captioning model that leverages the
linguistic knowledge from a large pretrained language model (LM). A crucial
challenge is to balance between the use of visual information in the image and
prior linguistic knowledge acquired from pretraining.We designed a novel
self-resurrecting encoder-decoder attention mechanism to quickly adapt the
pretrained LM as the language decoder on a small amount of in-domain training
data. The pro-posed self-resurrecting activation unit produces sparse
activations but is not susceptible to zero gradients. When trained on 0.1%,
0.5% and 1% of MSCOCO and Conceptual Captions, the proposed model, VisualGPT,
surpasses strong image captioning baselines. VisualGPT outperforms the best
baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on
Conceptual Captions.We also perform a series of ablation studies to quantify
the utility of each system component. To the best of our knowledge, this is the
first work that improves data efficiency of image captioning by utilizing LM
pretrained on unimodal data. Our code is available at:
https://github.com/Vision-CAIR/VisualGPT.
Related papers
- LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
Long text understanding is of great demands in language-image pre-training models.
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We validate the effectiveness of our approach using a self-constructed large-scale dataset.
It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation [23.631184498984933]
Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
arXiv Detail & Related papers (2021-04-18T19:55:31Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.