VLP: A Survey on Vision-Language Pre-training
- URL: http://arxiv.org/abs/2202.09061v2
- Date: Mon, 21 Feb 2022 02:58:34 GMT
- Title: VLP: A Survey on Vision-Language Pre-training
- Authors: Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang
Xu, Bo Xu
- Abstract summary: The emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era.
This paper surveys recent advances and new frontiers in vision-language pre-training, including image-text and video-text pre-training.
- Score: 24.093731037295502
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the past few years, the emergence of pre-training models has brought
uni-modal fields such as computer vision (CV) and natural language processing
(NLP) to a new era. Substantial works have shown they are beneficial for
downstream uni-modal tasks and avoid training a new model from scratch. So can
such pre-trained models be applied to multi-modal tasks? Researchers have
explored this problem and made significant progress. This paper surveys recent
advances and new frontiers in vision-language pre-training (VLP), including
image-text and video-text pre-training. To give readers a better overall grasp
of VLP, we first review its recent advances from five aspects: feature
extraction, model architecture, pre-training objectives, pre-training datasets,
and downstream tasks. Then, we summarize the specific VLP models in detail.
Finally, we discuss the new frontiers in VLP. To the best of our knowledge,
this is the first survey on VLP. We hope that this survey can shed light on
future research in the VLP field.
Related papers
- Large Language Models Meet NLP: A Survey [79.74450825763851]
Large language models (LLMs) have shown impressive capabilities in Natural Language Processing (NLP) tasks.
This study aims to address this gap by exploring the following questions.
arXiv Detail & Related papers (2024-05-21T14:24:01Z) - Medical Vision Language Pretraining: A survey [8.393439175704124]
Medical Vision Language Pretraining is a promising solution to the scarcity of labeled data in the medical domain.
By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations.
arXiv Detail & Related papers (2023-12-11T09:14:13Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - VindLU: A Recipe for Effective Video-and-Language Pretraining [83.49216853881595]
This paper conducts an empirical study demystifying the most important factors in the VidL model design.
Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining.
Our model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks.
arXiv Detail & Related papers (2022-12-09T18:54:05Z) - Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective [52.52870614418373]
Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
arXiv Detail & Related papers (2022-10-18T02:55:58Z) - Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models.
In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
arXiv Detail & Related papers (2022-02-18T15:15:46Z) - Recent Advances in Natural Language Processing via Large Pre-Trained
Language Models: A Survey [67.82942975834924]
Large, pre-trained language models such as BERT have drastically changed the Natural Language Processing (NLP) field.
We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches.
arXiv Detail & Related papers (2021-11-01T20:08:05Z) - Pre-trained Models for Natural Language Processing: A Survey [75.95500552357429]
The emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era.
This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
arXiv Detail & Related papers (2020-03-18T15:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.