Related papers: Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

URL: http://arxiv.org/abs/2409.14993v1
Date: Mon, 23 Sep 2024 13:16:09 GMT
Title: Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Authors: Hong Chen, Xin Wang, Yuwei Zhou, Bin Huang, Yipeng Zhang, Wei Feng, Houlun Chen, Zeyang Zhang, Siao Tang, Wenwu Zhu,
Abstract summary: Multi-modal generative AI has received increasing attention in both academia and industry. One natural question arises: Is it possible to have a unified model for both understanding and generation?
Score: 48.43910061720815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.

Related papers

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
A Survey of Generative Categories and Techniques in Multimodal Large Language Models [3.7507324448128876]
Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation.<n>This survey categorises six primary generative modalities and examines how foundational techniques enable cross-modal capabilities.
arXiv Detail & Related papers (2025-05-29T12:29:39Z)
MMaDA: Multimodal Large Diffusion Language Models [61.13527224215318]
We introduce MMaDA, a novel class of multimodal diffusion foundation models.<n>It is designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.
arXiv Detail & Related papers (2025-05-21T17:59:05Z)
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research. multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z)
MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.<n>MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.<n>We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z)
Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation. We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly. Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z)
Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z)
From Efficient Multimodal Models to World Models: A Survey [28.780451336834876]
Multimodal Large Models (MLMs) are becoming a significant research focus combining powerful language models with multimodal learning. This review explores the latest developments and challenges in large instructions, emphasizing their potential in achieving artificial general intelligence.
arXiv Detail & Related papers (2024-06-27T15:36:43Z)
Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities [5.22475289121031]
Multimodal models are expected to be a critical component to future advances in artificial intelligence. This work provides a fresh perspective on generalist multimodal models via a novel architecture and training configuration specific taxonomy.
arXiv Detail & Related papers (2024-06-08T15:30:46Z)
LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X. SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z)
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques. We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z)
Explaining latent representations of generative models with large multimodal models [5.9908087713968925]
Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. We propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model.
arXiv Detail & Related papers (2024-02-02T19:28:33Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z)
Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. A practical guide is provided, offering insights into the technical aspects of multimodal models. Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages. We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.