A Survey of Vision-Language Pre-Trained Models
- URL: http://arxiv.org/abs/2202.10936v1
- Date: Fri, 18 Feb 2022 15:15:46 GMT
- Title: A Survey of Vision-Language Pre-Trained Models
- Authors: Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao
- Abstract summary: Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
- Score: 41.323956143107644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Transformer evolved, pre-trained models have advanced at a breakneck pace
in recent years. They have dominated the mainstream techniques in natural
language processing (NLP) and computer vision (CV). How to adapt pre-training
to the field of Vision-and-Language (V-L) learning and improve the performance
on downstream tasks becomes a focus of multimodal learning. In this paper, we
review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As
the core content, we first briefly introduce several ways to encode raw images
and texts to single-modal embeddings before pre-training. Then, we dive into
the mainstream architectures of VL-PTMs in modeling the interaction between
text and image representations. We further present widely-used pre-training
tasks, after which we introduce some common downstream tasks. We finally
conclude this paper and present some promising research directions. Our survey
aims to provide multimodal researchers a synthesis and pointer to related
research.
Related papers
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z) - Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models.
In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - Vision-and-Language Pretrained Models: A Survey [3.270244666687303]
We present an overview of the major advances achieved in Visual-Language Pretrained Models.
We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content.
arXiv Detail & Related papers (2022-04-15T07:33:06Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.