SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads
- URL: http://arxiv.org/abs/2602.07449v3
- Date: Wed, 11 Feb 2026 12:34:52 GMT
- Title: SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads
- Authors: Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu,
- Abstract summary: We propose SoulX-FlashHead, a unified framework for real-time, infinite-length, and high-fidelity streaming video generation.<n>To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism.<n>We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training.
- Score: 19.531644258572353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
Related papers
- SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation [16.34443339642213]
textbfX-FlashTalk is a 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.<n>SoulX-FlashTalk is the first 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.
arXiv Detail & Related papers (2025-12-29T11:18:24Z) - Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length [57.458450695137664]
We present Live Avatar, an algorithm-system co-designed framework for efficient, high-fidelity, and infinite-length avatar generation.<n>Live Avatar is first to achieve practical, real-time, high-fidelity avatar generation at this scale.
arXiv Detail & Related papers (2025-12-04T11:11:24Z) - Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z) - LongLive: Real-time Interactive Long Video Generation [68.45945318075432]
LongLive is a frame-level autoregressive framework for real-time and interactive long video generation.<n>LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos.
arXiv Detail & Related papers (2025-09-26T17:48:24Z) - Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation [50.04968365065964]
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference.<n>We introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP)<n>We also propose Decoupled Foreground Attention (DFA) to further accelerate attention computations.
arXiv Detail & Related papers (2025-08-25T02:58:39Z) - StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation [91.45910771331741]
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency.<n>This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing.
arXiv Detail & Related papers (2025-08-11T17:58:24Z) - READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models [17.858801012726445]
Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness.<n>We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges.<n>Our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
arXiv Detail & Related papers (2025-06-06T07:09:07Z) - CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes [31.783117836434403]
CD-NGP is a continual learning framework that reduces memory overhead and enhances scalability.<n>It significantly reduces training memory usage to 14GB and requires only 0.4MB/frame in streaming bandwidth on DyNeRF.
arXiv Detail & Related papers (2024-09-08T17:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.