Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
- URL: http://arxiv.org/abs/2409.02634v2
- Date: Thu, 5 Sep 2024 09:11:25 GMT
- Title: Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
- Authors: Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, Yanbo Zheng,
- Abstract summary: We propose an end-to-end audio-only conditioned video diffusion model named Loopy.
Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
- Score: 15.841490425454344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
Related papers
- Infinite Motion: Extended Motion Generation via Long Text Instructions [51.61117351997808]
"Infinite Motion" is a novel approach that leverages long text to extended motion generation.
Key innovation of our model is its ability to accept arbitrary lengths of text as input.
We incorporate the timestamp design for text which allows precise editing of local segments within the generated sequences.
arXiv Detail & Related papers (2024-07-11T12:33:56Z) - Controllable Longer Image Animation with Diffusion Models [12.565739255499594]
We introduce an open-domain controllable image animation method using motion priors with video diffusion models.
Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos.
We propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks.
arXiv Detail & Related papers (2024-05-27T16:08:00Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - MotionMix: Weakly-Supervised Diffusion for Controllable Motion
Generation [19.999239668765885]
MotionMix is a weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences.
Our framework consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
arXiv Detail & Related papers (2024-01-20T04:58:06Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z) - Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.