Multimodal Image Synthesis and Editing: The Generative AI Era
- URL: http://arxiv.org/abs/2112.13592v6
- Date: Thu, 24 Aug 2023 16:17:21 GMT
- Title: Multimodal Image Synthesis and Editing: The Generative AI Era
- Authors: Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu,
Lingjie Liu, Adam Kortylewski, Christian Theobalt, Eric Xing
- Abstract summary: multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
- Score: 131.9569600472503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As information exists in various modalities in real world, effective
interaction and fusion among multimodal information plays a key role for the
creation and perception of multimodal data in computer vision and deep learning
research. With superb power in modeling the interaction among multimodal
information, multimodal image synthesis and editing has become a hot research
topic in recent years. Instead of providing explicit guidance for network
training, multimodal guidance offers intuitive and flexible means for image
synthesis and editing. On the other hand, this field is also facing several
challenges in alignment of multimodal features, synthesis of high-resolution
images, faithful evaluation metrics, etc. In this survey, we comprehensively
contextualize the advance of the recent multimodal image synthesis and editing
and formulate taxonomies according to data modalities and model types. We start
with an introduction to different guidance modalities in image synthesis and
editing, and then describe multimodal image synthesis and editing approaches
extensively according to their model types. After that, we describe benchmark
datasets and evaluation metrics as well as corresponding experimental results.
Finally, we provide insights about the current research challenges and possible
directions for future research. A project associated with this survey is
available at https://github.com/fnzhan/Generative-AI.
Related papers
- Multimodal Alignment and Fusion: A Survey [7.250878248686215]
Multimodal integration enables improved model accuracy and broader applicability.
We systematically categorize and analyze existing alignment and fusion techniques.
This survey focuses on applications in domains like social media analysis, medical imaging, and emotion recognition.
arXiv Detail & Related papers (2024-11-26T02:10:27Z) - A Survey of Multimodal Composite Editing and Retrieval [7.966265020507201]
This survey is the first comprehensive review of the literature on multimodal composite retrieval.
It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval.
We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
arXiv Detail & Related papers (2024-09-09T08:06:50Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity.
This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms.
A practical guide is provided, offering insights into the technical aspects of multimodal models.
Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.