StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
- URL: http://arxiv.org/abs/2512.22065v1
- Date: Fri, 26 Dec 2025 15:41:24 GMT
- Title: StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
- Authors: Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, Yong-Jin Liu,
- Abstract summary: We propose a two-stage autoregressive adaptation and acceleration framework to adapt a high-fidelity human video diffusion model for real-time, interactive streaming.<n>We develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures.<n>Our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.
- Score: 32.75338796722652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .
Related papers
- HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation [55.73037290387896]
We introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion.<n>First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions.<n>Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance.
arXiv Detail & Related papers (2026-01-28T08:47:23Z) - FlowAct-R1: Towards Interactive Humanoid Video Generation [37.04996721172613]
FlowAct-R1 is a framework specifically designed for real-time interactive humanoid video generation.<n>Our framework achieves a stable 25 fps at 480p resolution with a time-to-first-frame (F) of only around 1.5 seconds.
arXiv Detail & Related papers (2026-01-15T06:16:22Z) - Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation [71.38488610271247]
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation.<n>Current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement.<n>We propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing.
arXiv Detail & Related papers (2026-01-02T11:58:48Z) - Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models [80.28579390566298]
We introduce Interact2Ar, a text-conditioned autoregressive diffusion model for generating full-body, human-human interactions.<n>Hand kinematics are incorporated through dedicated parallel branches, enabling high-fidelity full-body generation.<n>Our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios.
arXiv Detail & Related papers (2025-12-22T18:59:50Z) - Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length [57.458450695137664]
We present Live Avatar, an algorithm-system co-designed framework for efficient, high-fidelity, and infinite-length avatar generation.<n>Live Avatar is first to achieve practical, real-time, high-fidelity avatar generation at this scale.
arXiv Detail & Related papers (2025-12-04T11:11:24Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [71.90109867684025]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning [95.07708090428814]
We present REWIND, a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs.<n>We introduce cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions.<n>We also propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality.
arXiv Detail & Related papers (2025-04-07T11:44:11Z) - Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation [5.78796187123888]
We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise.
We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance.
arXiv Detail & Related papers (2023-07-02T13:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.