A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
- URL: http://arxiv.org/abs/2504.00837v2
- Date: Sun, 20 Apr 2025 12:55:44 GMT
- Title: A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
- Authors: Shuyu Li, Shulei Ji, Zihao Wang, Songruoyao Wu, Jiaxing Yu, Kejun Zhang,
- Abstract summary: Multi-modal music generation is an emerging research area with broad applications.<n>This paper reviews this field, categorizing music generation systems from the perspective of modalities.<n>Key challenges in this area include effective multi-modal integration, large-scale comprehensive datasets, and systematic evaluation methods.
- Score: 14.69952700449563
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal music generation, using multiple modalities like text, images, and video alongside musical scores and audio as guidance, is an emerging research area with broad applications. This paper reviews this field, categorizing music generation systems from the perspective of modalities. The review covers modality representation, multi-modal data alignment, and their utilization to guide music generation. Current datasets and evaluation methods are also discussed. Key challenges in this area include effective multi-modal integration, large-scale comprehensive datasets, and systematic evaluation methods. Finally, an outlook on future research directions is provided, focusing on creativity, efficiency, multi-modal alignment, and evaluation.
Related papers
- Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey [124.23247710880008]
multimodal CoT (MCoT) reasoning has recently garnered significant research attention.
Existing MCoT studies design various methodologies to address the challenges of image, video, speech, audio, 3D, and structured data.
We present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions.
arXiv Detail & Related papers (2025-03-16T18:39:13Z) - A Comprehensive Survey on Generative AI for Video-to-Music Generation [15.575851379886952]
This paper presents a comprehensive review of video-to-music generation using deep generative AI techniques.
We focus on three key components: visual feature extraction, music generation frameworks, and conditioning mechanisms.
arXiv Detail & Related papers (2025-02-18T03:18:54Z) - Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation [2.549112678136113]
Retrieval-Augmented Generation (RAG) mitigates issues by integrating external dynamic information enhancing factual and updated grounding.<n>Cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.<n>This survey lays the foundation for developing more capable and reliable AI systems.
arXiv Detail & Related papers (2025-02-12T22:33:41Z) - Multimodal Alignment and Fusion: A Survey [7.250878248686215]
Multimodal integration enables improved model accuracy and broader applicability.
We systematically categorize and analyze existing alignment and fusion techniques.
This survey focuses on applications in domains like social media analysis, medical imaging, and emotion recognition.
arXiv Detail & Related papers (2024-11-26T02:10:27Z) - A Survey of Multimodal Composite Editing and Retrieval [7.966265020507201]
This survey is the first comprehensive review of the literature on multimodal composite retrieval.
It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval.
We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
arXiv Detail & Related papers (2024-09-09T08:06:50Z) - Applications and Advances of Artificial Intelligence in Music Generation:A Review [0.04551615447454769]
This paper provides a systematic review of the latest research advancements in AI music generation.
It covers key technologies, models, datasets, evaluation methods, and their practical applications across various fields.
arXiv Detail & Related papers (2024-09-03T13:50:55Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - Multimodal Pretraining and Generation for Recommendation: A Tutorial [54.07497722719509]
The tutorial comprises three parts: multimodal pretraining, multimodal generation, and industrial applications.
It aims to facilitate a swift understanding of multimodal recommendation and foster meaningful discussions on the future development of this evolving landscape.
arXiv Detail & Related papers (2024-05-11T06:15:22Z) - Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity.
This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms.
A practical guide is provided, offering insights into the technical aspects of multimodal models.
Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.