Related papers: AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective

AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective

URL: http://arxiv.org/abs/2509.12052v2
Date: Thu, 16 Oct 2025 16:37:59 GMT
Title: AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective
Authors: Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han,
Abstract summary: AvatarSync is an autoregressive framework on phoneme representations that generates realistic talking-head animations from a single reference image.<n>We show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency.
Score: 15.69417162113696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking-head animation focuses on generating realistic facial videos from audio input. Following Generative Adversarial Networks (GANs), diffusion models have become the mainstream, owing to their robust generative capacities. However, inherent limitations of the diffusion process often lead to inter-frame flicker and slow inference, restricting their practical deployment. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly by text or audio input. To mitigate flicker and ensure continuity, AvatarSync leverages an autoregressive pipeline that enhances temporal modeling. In addition, to ensure controllability, we introduce phonemes, which are the basic units of speech sounds, and construct a many-to-one mapping from text/audio to phonemes, enabling precise phoneme-to-visual alignment. Additionally, to further accelerate inference, we adopt a two-stage generation strategy that decouples semantic modeling from visual dynamics, and incorporate a customized Phoneme-Frame Causal Attention Mask to support multi-step parallel acceleration. Extensive experiments conducted on both Chinese (CMLR) and English (HDTF) datasets demonstrate that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

Related papers

JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning [18.72712280434528]
JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
arXiv Detail & Related papers (2026-01-31T13:00:57Z)
PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z)
TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model [18.910745982208965]
TalkingPose is a novel diffusion-based framework for producing temporally consistent human upper-body animations.<n>We introduce a feedback-driven mechanism built upon image-based diffusion models to ensure continuous motion and enhance temporal coherence.<n>We also introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
arXiv Detail & Related papers (2025-11-30T14:26:24Z)
MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control [48.94486508604052]
MAGIC-Talk is a one-shot diffusion-based framework for customizable talking face generation.<n> ReferenceNet preserves identity and enables fine-grained facial editing via text prompts.<n>AnimateNet enhances motion coherence using structured motion priors.
arXiv Detail & Related papers (2025-10-26T19:49:31Z)
Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z)
Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation [69.50178144839275]
Singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics.<n>Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results.<n>Think2Sing generates semantically coherent and temporally consistent 3D head animations conditioned on both lyrics and acoustics.
arXiv Detail & Related papers (2025-09-02T12:59:27Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio [27.619816538121327]
X-Actor generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.<n>By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics.<n>X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations.
arXiv Detail & Related papers (2025-08-04T22:57:01Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation [13.089363781114477]
DiTalker is a unified DiT-based framework for speaking style-controllable portrait animation.<n>We introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers.<n>Experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability.
arXiv Detail & Related papers (2025-07-29T08:23:56Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking [22.337906095079198]
We present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text.<n>Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [34.43272121705662]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement [8.973545189395953]
This study focuses on the creation of visually compelling, time-synchronized animations through diffusion-based techniques. We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin. The significant improvements in the fidelity of animated portraits, the accuracy of lip-syncing, and the appropriate motion variations achieved by our method render it a versatile tool for animating any portrait in any language.
arXiv Detail & Related papers (2024-07-26T08:30:06Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.