Related papers: X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

URL: http://arxiv.org/abs/2508.02944v1
Date: Mon, 04 Aug 2025 22:57:01 GMT
Title: X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio
Authors: Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao, Tianpei Gu, Guoxian Song, Xin Chen, Chao Liang, Jianwen Jiang, Linjie Luo,
Abstract summary: X-Actor generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.<n>By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics.<n>X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations.
Score: 27.619816538121327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.

Related papers

Shushing! Let's Imagine an Authentic Speech from the Silent Video [15.426152742881365]
Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals.<n>Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues.<n>We introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input.
arXiv Detail & Related papers (2025-03-19T06:28:17Z)
SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.<n>Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z)
X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.<n>It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z)
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression [33.886734972316326]
GoHD is a framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.<n>An animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles.<n>A conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody.<n>A two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions.
arXiv Detail & Related papers (2024-12-12T14:12:07Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement [8.973545189395953]
This study focuses on the creation of visually compelling, time-synchronized animations through diffusion-based techniques. We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin. The significant improvements in the fidelity of animated portraits, the accuracy of lip-syncing, and the appropriate motion variations achieved by our method render it a versatile tool for animating any portrait in any language.
arXiv Detail & Related papers (2024-07-26T08:30:06Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z)
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions. Our framework consists of a speaker-independent stage and a speaker-specific stage. Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z)
Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces. With the disentangled features, dynamic 2D emotional facial landmarks can be deduced. Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.