Related papers: Lets Play Music: Audio-driven Performance Video Generation

Lets Play Music: Audio-driven Performance Video Generation

URL: http://arxiv.org/abs/2011.02631v1
Date: Thu, 5 Nov 2020 03:13:46 GMT
Title: Lets Play Music: Audio-driven Performance Video Generation
Authors: Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He
Abstract summary: We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
Score: 58.77609661515749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realisticand synchronized performance video from given music. Firstly,we provide both global appearance and local spatial informationby generating the coarse videos and keypoints of body and handsfrom a given music respectively. Then, we propose to transformthe generated keypoints to heatmap via a differentiable spacetransformer, since the heatmap offers more spatial informationbut is harder to generate directly from audio. Finally, wepropose a Structured Temporal UNet (STU) to extract bothintra-frame structured information and inter-frame temporalconsistency. They are obtained via graph-based structure module,and CNN-GRU based high-level temporal module respectively forfinal video generation. Comprehensive experiments validate theeffectiveness of our proposed framework.

Related papers

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Let Your Video Listen to Your Music! [62.27731415767459]
We propose a novel framework, MVAA, that automatically edits video to align with the rhythm of a given music track.<n>We modularize the task into a two-step process in our MVAA: aligning motion with audio beats, followed by rhythm-aware video editing.<n>This hybrid approach enables adaptation within 10 minutes with one on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone.
arXiv Detail & Related papers (2025-06-23T17:52:16Z)
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z)
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation [4.019144083959918]
We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video, TANGO produces high-fidelity videos with synchronized body gestures.
arXiv Detail & Related papers (2024-10-05T16:30:46Z)
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs. We develop a generative video-music Transformer with a novel semantic video-music alignment scheme. New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs. VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z)
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.