Every Image Listens, Every Image Dances: Music-Driven Image Animation
- URL: http://arxiv.org/abs/2501.18801v1
- Date: Thu, 30 Jan 2025 23:38:51 GMT
- Title: Every Image Listens, Every Image Dances: Music-Driven Image Animation
- Authors: Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak,
- Abstract summary: MuseDance is an end-to-end model that animates reference images using both music and text inputs.
Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences.
We present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions.
- Score: 8.085267959520843
- License:
- Abstract: Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.
Related papers
- One-Shot Learning Meets Depth Diffusion in Multi-Object Videos [0.0]
This paper introduces a novel depth-conditioning approach that enables the generation of coherent and diverse videos from just a single text-video pair.
Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms.
During inference, we use the DDIM inversion to provide structural guidance for video generation.
arXiv Detail & Related papers (2024-08-29T16:58:10Z) - Dance Any Beat: Blending Beats with Visuals in Dance Video Generation [12.018432669719742]
We introduce a novel task: generating dance videos directly from images of individuals guided by music.
Our solution, the Dance Any Beat Diffusion model (DabFusion), utilizes a reference image and a music piece to generate dance videos.
We evaluate DabFusion's performance using the AIST++ dataset, focusing on video quality, audio-video synchronization, and motion-music alignment.
arXiv Detail & Related papers (2024-05-15T11:33:07Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities.
In this paper, we propose a novel framework tailored for character animation.
By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z) - Make Pixels Dance: High-Dynamic Video Generation [13.944607760918997]
State-of-the-art video generation methods tend to produce video clips with minimal motions despite maintaining high fidelity.
We introduce PixelDance, a novel approach that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation.
arXiv Detail & Related papers (2023-11-18T06:25:58Z) - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos.
The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance.
Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z) - TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration [75.37311932218773]
We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities.
Our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities.
arXiv Detail & Related papers (2023-04-05T12:58:33Z) - BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis [123.73677487809418]
We introduce a new dataset aiming to challenge common assumptions in dance motion synthesis.
We focus on breakdancing which features acrobatic moves and tangled postures.
Our efforts produced the BRACE dataset, which contains over 3 hours and 30 minutes of densely annotated poses.
arXiv Detail & Related papers (2022-07-20T18:03:54Z) - MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware
Meta-learning [51.78302763617991]
Dancing video aims to synthesize a video that transfers the dance movements from a source video to a target person.
Previous work need collect a several-minute-long video of a target person with thousands of frames to train a personalized model.
Recent work tackled few-shot dancing video, which learns to synthesize videos of unseen persons by leveraging a few frames of them.
arXiv Detail & Related papers (2022-01-13T09:34:20Z) - Learning to Generate Diverse Dance Motions with Transformer [67.43270523386185]
We introduce a complete system for dance motion synthesis.
A massive dance motion data set is created from YouTube videos.
A novel two-stream motion transformer generative model can generate motion sequences with high flexibility.
arXiv Detail & Related papers (2020-08-18T22:29:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.