Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
- URL: http://arxiv.org/abs/2505.02567v4
- Date: Fri, 27 Jun 2025 13:30:10 GMT
- Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
- Authors: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang,
- Abstract summary: We present a comprehensive survey aimed at guiding future research.<n>We review existing unified models, categorizing them into three main architectural paradigms.<n>We discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data.
- Score: 22.476740954286836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).
Related papers
- A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction [21.966560704390716]
We review current research on Generative Model Unlearning (GenMU)<n>We propose a unified analytical framework for categorizing unlearning objectives, methodological strategies, and evaluation metrics.<n>We highlight the potential practical value of unlearning techniques in real-world applications.
arXiv Detail & Related papers (2025-07-26T09:49:57Z) - Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers [90.4459196223986]
A similar evolution is now unfolding in AI, marking a paradigm shift from models that merely think about images to those that can truly think with images.<n>This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace.
arXiv Detail & Related papers (2025-06-30T14:48:35Z) - Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - From Task-Specific Models to Unified Systems: A Review of Model Merging Approaches [13.778158813149833]
This paper establishes a new taxonomy of model merging methods, systematically comparing different approaches and providing an overview of key developments.<n>Despite the rapid progress in this field, a comprehensive taxonomy and survey summarizing recent advances and predicting future directions are still lacking.
arXiv Detail & Related papers (2025-03-12T02:17:31Z) - Personalized Image Generation with Deep Generative Models: A Decade Survey [51.26287478042516]
We present a review of generalized personalized image generation across various generative models.<n>We first define a unified framework that standardizes the personalization process across different generative models.<n>We then provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations.
arXiv Detail & Related papers (2025-02-18T17:34:04Z) - Explainability for Vision Foundation Models: A Survey [3.570403495760109]
Foundation models occupy an ambiguous position in the explainability domain.<n>Foundation models are characterized by their extensive generalization capabilities and emergent uses.<n>We discuss the challenges faced by current research in integrating XAI within foundation models.
arXiv Detail & Related papers (2025-01-21T15:18:55Z) - Autoregressive Models in Vision: A Survey [119.23742136065307]
This survey comprehensively examines the literature on autoregressive models applied to vision.
We divide visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models.
We present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation.
arXiv Detail & Related papers (2024-11-08T17:15:12Z) - Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities [5.22475289121031]
Multimodal models are expected to be a critical component to future advances in artificial intelligence.
This work provides a fresh perspective on generalist multimodal models via a novel architecture and training configuration specific taxonomy.
arXiv Detail & Related papers (2024-06-08T15:30:46Z) - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence.<n>It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works.<n>Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z) - On the Challenges and Opportunities in Generative AI [157.96723998647363]
We argue that current large-scale generative AI models exhibit several fundamental shortcomings that hinder their widespread adoption across domains.<n>We aim to provide researchers with insights for exploring fruitful research directions, thus fostering the development of more robust and accessible generative AI solutions.
arXiv Detail & Related papers (2024-02-28T15:19:33Z) - Generative AI in Vision: A Survey on Models, Metrics and Applications [0.0]
Generative AI models have revolutionized various fields by enabling the creation of realistic and diverse data samples.
Among these models, diffusion models have emerged as a powerful approach for generating high-quality images, text, and audio.
This survey paper provides a comprehensive overview of generative AI diffusion and legacy models, focusing on their underlying techniques, applications across different domains, and their challenges.
arXiv Detail & Related papers (2024-02-26T07:47:12Z) - Towards the Unification of Generative and Discriminative Visual
Foundation Model: A Survey [30.528346074194925]
Visual foundation models (VFMs) have become a catalyst for groundbreaking developments in computer vision.
This review paper delineates the pivotal trajectories of VFMs, emphasizing their scalability and proficiency in generative tasks.
A crucial direction for forthcoming innovation is the amalgamation of generative and discriminative paradigms.
arXiv Detail & Related papers (2023-12-15T19:17:15Z) - Graph Foundation Models: Concepts, Opportunities and Challenges [66.37994863159861]
Foundation models have emerged as critical components in a variety of artificial intelligence applications.<n>The capabilities of foundation models in generalization and adaptation motivate graph machine learning researchers to discuss the potential of developing a new graph learning paradigm.<n>This article introduces the concept of Graph Foundation Models (GFMs), and offers an exhaustive explanation of their key characteristics and underlying technologies.
arXiv Detail & Related papers (2023-10-18T09:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.