Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
- URL: http://arxiv.org/abs/2309.10020v1
- Date: Mon, 18 Sep 2023 17:56:28 GMT
- Title: Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
- Authors: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan
Wang, Jianfeng Gao
- Abstract summary: The research landscape encompasses five core topics, categorized into two classes.
The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities.
- Score: 187.72038587829223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a comprehensive survey of the taxonomy and evolution of
multimodal foundation models that demonstrate vision and vision-language
capabilities, focusing on the transition from specialist models to
general-purpose assistants. The research landscape encompasses five core
topics, categorized into two classes. (i) We start with a survey of
well-established research areas: multimodal foundation models pre-trained for
specific purposes, including two topics -- methods of learning vision backbones
for visual understanding and text-to-image generation. (ii) Then, we present
recent advances in exploratory, open research areas: multimodal foundation
models that aim to play the role of general-purpose assistants, including three
topics -- unified vision models inspired by large language models (LLMs),
end-to-end training of multimodal LLMs, and chaining multimodal tools with
LLMs. The target audiences of the paper are researchers, graduate students, and
professionals in computer vision and vision-language multimodal communities who
are eager to learn the basics and recent advances in multimodal foundation
models.
Related papers
- Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond [51.141270065306514]
This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI.
We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language.
Hands-on laboratories will offer practical experience with state-of-the-art multimodal models.
arXiv Detail & Related papers (2024-10-08T01:41:56Z) - Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities [5.22475289121031]
Multimodal models are expected to be a critical component to future advances in artificial intelligence.
This work provides a fresh perspective on generalist multimodal models via a novel architecture and training configuration specific taxonomy.
arXiv Detail & Related papers (2024-06-08T15:30:46Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - The Revolution of Multimodal Large Language Models: A Survey [46.84953515670248]
Multimodal Large Language Models (MLLMs) can seamlessly integrate visual and textual modalities.
This paper provides a review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques.
arXiv Detail & Related papers (2024-02-19T19:01:01Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions [68.6358773622615]
This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
arXiv Detail & Related papers (2022-09-07T19:21:19Z) - New Ideas and Trends in Deep Multimodal Content Understanding: A Review [24.576001583494445]
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.
This paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants.
arXiv Detail & Related papers (2020-10-16T06:50:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.