Pre-training image-language transformers for open-vocabulary tasks
- URL: http://arxiv.org/abs/2209.04372v1
- Date: Fri, 9 Sep 2022 16:11:11 GMT
- Title: Pre-training image-language transformers for open-vocabulary tasks
- Authors: AJ Piergiovanni and Weicheng Kuo and Anelia Angelova
- Abstract summary: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
- Score: 53.446599611203474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a pre-training approach for vision and language transformer
models, which is based on a mixture of diverse tasks. We explore both the use
of image-text captioning data in pre-training, which does not need additional
supervision, as well as object-aware strategies to pre-train the model. We
evaluate the method on a number of textgenerative vision+language tasks, such
as Visual Question Answering, visual entailment and captioning, and demonstrate
large gains over standard pre-training methods.
Related papers
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models.
In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.