Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
- URL: http://arxiv.org/abs/2503.18429v1
- Date: Mon, 24 Mar 2025 08:16:47 GMT
- Title: Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
- Authors: Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, Ming Tao,
- Abstract summary: We introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head.<n>We propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation.
- Score: 18.45773436423025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.
Related papers
- FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.
In the first stage, we employ a clip-level training scheme to establish coherent global motion.
In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model [41.35209566957009]
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips.<n>We introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks.
arXiv Detail & Related papers (2025-02-27T17:49:01Z) - JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation [10.003794924759765]
JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation.<n>We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations.<n>In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
arXiv Detail & Related papers (2024-11-14T06:13:05Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - MagicAnimate: Temporally Consistent Human Image Animation using
Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion.
We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z) - AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models
without Specific Tuning [92.33690050667475]
AnimateDiff is a framework for animating personalized T2I models without requiring model-specific tuning.
We propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns.
Results show that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity.
arXiv Detail & Related papers (2023-07-10T17:34:16Z) - CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness.
We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook.
We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Real-Time Cleaning and Refinement of Facial Animation Signals [0.0]
We propose a real-time animation refining system that preserves -- or even restores -- the natural dynamics of facial motions.
We leverage an off-the-shelf recurrent neural network architecture that learns proper facial dynamics patterns on clean animation data.
arXiv Detail & Related papers (2020-08-04T05:21:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.