E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning
- URL: http://arxiv.org/abs/2106.01804v2
- Date: Fri, 4 Jun 2021 06:56:48 GMT
- Title: E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning
- Authors: Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming
Xiao and Fei Huang
- Abstract summary: We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
- Score: 31.622393984150314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training (VLP) on large-scale image-text pairs has
achieved huge success for the cross-modal downstream tasks. The most existing
pre-training methods mainly adopt a two-step training procedure, which firstly
employs a pre-trained object detector to extract region-based visual features,
then concatenates the image representation and text embedding as the input of
Transformer to train. However, these methods face problems of using
task-specific visual representation of the specific object detector for generic
cross-modal understanding, and the computation inefficiency of two-stage
pipeline. In this paper, we propose the first end-to-end vision-language
pre-trained model for both V+L understanding and generation, namely E2E-VLP,
where we build a unified Transformer framework to jointly learn visual
representation, and semantic alignments between image and text. We incorporate
the tasks of object detection and image captioning into pre-training with a
unified Transformer encoder-decoder architecture for enhancing visual learning.
An extensive set of experiments have been conducted on well-established
vision-language downstream tasks to demonstrate the effectiveness of this novel
VLP paradigm.
Related papers
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z) - SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple
Levels [35.57369098866317]
Vision-language pre-training on large-scale image-text pairs has witnessed rapid progress for learning cross-modal representations.
We propose a new pre-training method which jointly aligns both the low-level and high-level semantics between image and text representations.
arXiv Detail & Related papers (2021-03-14T02:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.