Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
- URL: http://arxiv.org/abs/2210.09263v1
- Date: Mon, 17 Oct 2022 17:11:36 GMT
- Title: Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
- Authors: Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng
Gao
- Abstract summary: Vision-language pre-training methods for multimodal intelligence have been developed in the last few years.
For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced.
In addition, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
- Score: 158.34830433299268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these
approaches into three categories: ($i$) VLP for image-text tasks, such as image
captioning, image-text retrieval, visual question answering, and visual
grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image
classification, object detection, and segmentation; and ($iii$) VLP for
video-text tasks, such as video captioning, video-text retrieval, and video
question answering. For each category, we present a comprehensive review of
state-of-the-art methods, and discuss the progress that has been made and
challenges still being faced, using specific systems and models as case
studies. In addition, for each category, we discuss advanced topics being
actively explored in the research community, such as big foundation models,
unified modeling, in-context few-shot learning, knowledge, robustness, and
computer vision in the wild, to name a few.
Related papers
- A Review of Deep Learning for Video Captioning [111.1557921247882]
Video captioning (VC) is a fast-moving, cross-disciplinary area of research.
This survey covers deep learning-based VC, including but not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC)
arXiv Detail & Related papers (2023-04-22T15:30:54Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z) - A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
arXiv Detail & Related papers (2022-02-18T15:15:46Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Image Segmentation Using Deep Learning: A Survey [58.37211170954998]
Image segmentation is a key topic in image processing and computer vision.
There has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models.
arXiv Detail & Related papers (2020-01-15T21:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.