Vision-to-Music Generation: A Survey
- URL: http://arxiv.org/abs/2503.21254v1
- Date: Thu, 27 Mar 2025 08:21:54 GMT
- Title: Vision-to-Music Generation: A Survey
- Authors: Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao,
- Abstract summary: Vision-to-music generation shows vast application prospects in fields such as film scoring, short video creation, and dance music synthesis.<n>Research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video.<n>Existing surveys focus on general music generation without comprehensive discussion on vision-to-music.
- Score: 10.993775589904251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.
Related papers
- Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.
High-level semantics are conveyed through a cross-attention mechanism.
Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z) - ASurvey: Spatiotemporal Consistency in Video Generation [72.82267240482874]
Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC)<n>Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective.<n>We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
arXiv Detail & Related papers (2025-02-25T05:20:51Z) - A Comprehensive Survey on Generative AI for Video-to-Music Generation [15.575851379886952]
This paper presents a comprehensive review of video-to-music generation using deep generative AI techniques.<n>We focus on three key components: visual feature extraction, music generation frameworks, and conditioning mechanisms.
arXiv Detail & Related papers (2025-02-18T03:18:54Z) - GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [13.9134271174972]
We present General Video-to-Music Generation model (GVMGen) for generating high-related music to the video input.<n>Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions.<n>Our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios.
arXiv Detail & Related papers (2025-01-17T06:30:11Z) - A Survey of Foundation Models for Music Understanding [60.83532699497597]
This work is one of the early reviews of the intersection of AI techniques and music understanding.
We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities.
arXiv Detail & Related papers (2024-09-15T03:34:14Z) - Prevailing Research Areas for Music AI in the Era of Foundation Models [8.067636023395236]
There has been a surge of generative music AI applications within the past few years.
We discuss the current state of music datasets and their limitations.
We highlight applications of these generative models towards extensions to multiple modalities and integration with artists' workflow.
arXiv Detail & Related papers (2024-09-14T09:06:43Z) - Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - A Review of Intelligent Music Generation Systems [4.287960539882345]
ChatGPT has significantly reduced the barrier to entry for non-professionals in creative endeavors.
Modern generative algorithms can extract patterns implicit in a piece of music based on rule constraints or a musical corpus.
arXiv Detail & Related papers (2022-11-16T13:43:16Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - A Comprehensive Survey on Deep Music Generation: Multi-level
Representations, Algorithms, Evaluations, and Future Directions [10.179835761549471]
This paper attempts to provide an overview of various composition tasks under different music generation levels using deep learning.
In addition, we summarize datasets suitable for diverse tasks, discuss the music representations, the evaluation methods as well as the challenges under different levels, and finally point out several future directions.
arXiv Detail & Related papers (2020-11-13T08:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.