Lets Play Music: Audio-driven Performance Video Generation
- URL: http://arxiv.org/abs/2011.02631v1
- Date: Thu, 5 Nov 2020 03:13:46 GMT
- Title: Lets Play Music: Audio-driven Performance Video Generation
- Authors: Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He
- Abstract summary: We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
- Score: 58.77609661515749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new task named Audio-driven Per-formance Video Generation
(APVG), which aims to synthesizethe video of a person playing a certain
instrument guided bya given music audio clip. It is a challenging task to
gener-ate the high-dimensional temporal consistent videos from low-dimensional
audio modality. In this paper, we propose a multi-staged framework to achieve
this new task to generate realisticand synchronized performance video from
given music. Firstly,we provide both global appearance and local spatial
informationby generating the coarse videos and keypoints of body and handsfrom
a given music respectively. Then, we propose to transformthe generated
keypoints to heatmap via a differentiable spacetransformer, since the heatmap
offers more spatial informationbut is harder to generate directly from audio.
Finally, wepropose a Structured Temporal UNet (STU) to extract bothintra-frame
structured information and inter-frame temporalconsistency. They are obtained
via graph-based structure module,and CNN-GRU based high-level temporal module
respectively forfinal video generation. Comprehensive experiments validate
theeffectiveness of our proposed framework.
Related papers
- VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [68.72384258320743]
We propose VidMuse, a framework for generating music aligned with video inputs.
VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation.
We present SymMV, a video and symbolic music dataset with various musical annotations.
We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z) - Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Audeo: Audio Generation for a Silent Performance Video [17.705770346082023]
We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video.
Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events.
arXiv Detail & Related papers (2020-06-23T00:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.