PersonaLive! Expressive Portrait Image Animation for Live Streaming
- URL: http://arxiv.org/abs/2512.11253v1
- Date: Fri, 12 Dec 2025 03:24:40 GMT
- Title: PersonaLive! Expressive Portrait Image Animation for Live Streaming
- Authors: Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun,
- Abstract summary: PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
- Score: 53.63615310186964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
Related papers
- TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model [18.910745982208965]
TalkingPose is a novel diffusion-based framework for producing temporally consistent human upper-body animations.<n>We introduce a feedback-driven mechanism built upon image-based diffusion models to ensure continuous motion and enhance temporal coherence.<n>We also introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
arXiv Detail & Related papers (2025-11-30T14:26:24Z) - StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z) - Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z) - Stable Video-Driven Portraits [52.008400639227034]
Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
arXiv Detail & Related papers (2025-09-22T08:11:08Z) - Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation [72.20148916920944]
Follow-Your-Emoji-Faster is an efficient diffusion-based framework for portrait animation driven by facial landmarks.<n>Our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals.<n>EmojiBench++ is a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences.
arXiv Detail & Related papers (2025-09-20T11:09:01Z) - X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio [27.619816538121327]
X-Actor generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.<n>By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics.<n>X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations.
arXiv Detail & Related papers (2025-08-04T22:57:01Z) - EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation [58.41979933166173]
EvAnimate is the first method leveraging event streams as robust and precise motion cues for conditional human image animation.<n>High-quality and temporally coherent animations are achieved through a dual-branch architecture.<n>Experiment results show EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
arXiv Detail & Related papers (2025-03-24T11:05:41Z) - JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation [10.003794924759765]
JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation.<n>We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations.<n>In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
arXiv Detail & Related papers (2024-11-14T06:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.