Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models
- URL: http://arxiv.org/abs/2203.01922v1
- Date: Thu, 3 Mar 2022 18:54:59 GMT
- Title: Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models
- Authors: Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni,
PengChuan Zhang, Lei Zhang
- Abstract summary: This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
- Score: 32.142076223602906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a comprehensive survey of vision-language (VL)
intelligence from the perspective of time. This survey is inspired by the
remarkable progress in both computer vision and natural language processing,
and recent trends shifting from single modality processing to multiple modality
comprehension. We summarize the development in this field into three time
periods, namely task-specific methods, vision-language pre-training (VLP)
methods, and larger models empowered by large-scale weakly-labeled data. We
first take some common VL tasks as examples to introduce the development of
task-specific methods. Then we focus on VLP methods and comprehensively review
key components of the model structures and training methods. After that, we
show how recent work utilizes large-scale raw image-text data to learn
language-aligned visual representations that generalize better on zero or few
shot learning tasks. Finally, we discuss some potential future trends towards
modality cooperation, unified representation, and knowledge incorporation. We
believe that this review will be of help for researchers and practitioners of
AI and ML, especially those interested in computer vision and natural language
processing.
Related papers
- VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models.
In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z) - Vision-and-Language Pretrained Models: A Survey [3.270244666687303]
We present an overview of the major advances achieved in Visual-Language Pretrained Models.
We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content.
arXiv Detail & Related papers (2022-04-15T07:33:06Z) - A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
arXiv Detail & Related papers (2022-02-18T15:15:46Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.