JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
- URL: http://arxiv.org/abs/2512.11423v1
- Date: Fri, 12 Dec 2025 10:06:01 GMT
- Title: JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
- Authors: Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He,
- Abstract summary: JoyAvatar is an audio-driven autoregressive model capable of real-time inference and infinite-length video generation.<n>Our model achieves competitive results in visual quality, temporal consistency, and lip synchronization.
- Score: 19.420963062956222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.
Related papers
- EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation [8.795438456031512]
Multi-modal generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment.<n> Streaming inference exacerbates these issues, leading to pronounced multimodal ambiguities, such as blurring, temporal drift, and lip dechronization.<n>We propose EchoTorrent, a novel novel with a fourfold schema: Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains; Adaptive DMD (ACCDMD) calibrates the audio CFG degradation errors in phased via a schedule; Long Hybrid Tail, which enforces alignment exclusively on tail frames during long-horizon self-roll
arXiv Detail & Related papers (2026-02-14T08:32:38Z) - Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length [57.458450695137664]
We present Live Avatar, an algorithm-system co-designed framework for efficient, high-fidelity, and infinite-length avatar generation.<n>Live Avatar is first to achieve practical, real-time, high-fidelity avatar generation at this scale.
arXiv Detail & Related papers (2025-12-04T11:11:24Z) - Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models [11.913945404405865]
Most video diffusion models (VDMs) generate videos in an autoregressive manner, generating subsequent iteration frames conditioned on previous ones.<n>We propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs.
arXiv Detail & Related papers (2025-11-15T08:29:14Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z) - POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models [18.761042377485367]
POSE (Phased One-Step Equilibrium) is a distillation framework that reduces the sampling steps of large-scale video diffusion models.<n>We show that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality.
arXiv Detail & Related papers (2025-08-28T17:20:01Z) - StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation [91.45910771331741]
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency.<n>This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing.
arXiv Detail & Related papers (2025-08-11T17:58:24Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.