Related papers: Pre-training image-language transformers for open-vocabulary tasks

Pre-training image-language transformers for open-vocabulary tasks

URL: http://arxiv.org/abs/2209.04372v1
Date: Fri, 9 Sep 2022 16:11:11 GMT
Title: Pre-training image-language transformers for open-vocabulary tasks
Authors: AJ Piergiovanni and Weicheng Kuo and Anelia Angelova
Abstract summary: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
Score: 53.446599611203474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

Related papers

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z)
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks. Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z)
Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our method consists of a multimodal transformer that encodes visual observations and language instructions. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z)
Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models. In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z)
VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation. We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.