Extending Visual Dynamics for Video-to-Music Generation
- URL: http://arxiv.org/abs/2504.07594v1
- Date: Thu, 10 Apr 2025 09:47:26 GMT
- Title: Extending Visual Dynamics for Video-to-Music Generation
- Authors: Xiaohao Liu, Teng Tu, Yunshan Ma, Tat-Seng Chua,
- Abstract summary: DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.<n>High-level semantics are conveyed through a cross-attention mechanism.<n>Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
- Score: 51.274561293909926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music profoundly enhances video production by improving quality, engagement, and emotional resonance, sparking growing interest in video-to-music generation. Despite recent advances, existing approaches remain limited in specific scenarios or undervalue the visual dynamics. To address these limitations, we focus on tackling the complexity of dynamics and resolving temporal misalignment between video and music representations. To this end, we propose DyViM, a novel framework to enhance dynamics modeling for video-to-music generation. Specifically, we extract frame-wise dynamics features via a simplified motion encoder inherited from optical flow methods, followed by a self-attention module for aggregation within frames. These dynamic features are then incorporated to extend existing music tokens for temporal alignment. Additionally, high-level semantics are conveyed through a cross-attention mechanism, and an annealing tuning strategy benefits to fine-tune well-trained music decoders efficiently, therefore facilitating seamless adaptation. Extensive experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
Related papers
- RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.<n>We advocate for the incorporation of a retrieval mechanism during the generation phase.<n>Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries [1.1743167854433303]
EMSYNC is a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries.
We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts.
In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.
arXiv Detail & Related papers (2025-02-14T13:32:59Z) - MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs.
We develop a generative video-music Transformer with a novel semantic video-music alignment scheme.
New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z) - CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery.<n>For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising.<n>In the later stage of denoising, we restore this module to repair the appearance details of the specified subject.
arXiv Detail & Related papers (2024-08-23T17:26:06Z) - CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video [25.551944406980297]
We propose a novel approach to generate high-quality novel views from monocular videos of complex and dynamic scenes.
We introduce a module that operates in both the time and frequency domains to aggregate the features of object motion.
Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2024-01-10T00:40:05Z) - MAGMA: Music Aligned Generative Motion Autodecoder [15.825872274297735]
We introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE)
We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms.
Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences.
arXiv Detail & Related papers (2023-09-03T15:21:47Z) - DynIBaR: Neural Dynamic Image-Based Rendering [79.44655794967741]
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene.
We adopt a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2022-11-20T20:57:02Z) - Dance In the Wild: Monocular Human Animation with Neural Dynamic
Appearance Synthesis [56.550999933048075]
We propose a video based synthesis method that tackles challenges and demonstrates high quality results for in-the-wild videos.
We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes.
We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-11-10T20:18:57Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.