READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
- URL: http://arxiv.org/abs/2508.03457v2
- Date: Wed, 06 Aug 2025 11:50:17 GMT
- Title: READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
- Authors: Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, Qingfeng Liu,
- Abstract summary: We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
- Score: 55.58089937219475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.
Related papers
- MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z) - Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z) - Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding [21.682444278458433]
Speech Speculative Decoding (SSD) is a novel framework for autoregressive speech synthesis acceleration.<n>SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding.
arXiv Detail & Related papers (2025-05-21T11:17:04Z) - Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion [0.881371061335494]
We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation.<n>RDLA restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously.<n>This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup.
arXiv Detail & Related papers (2025-03-13T15:54:45Z) - Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis [27.43583075023949]
Ditto is a diffusion-based talking head framework that enables fine-grained controls and real-time inference.<n>We show that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.
arXiv Detail & Related papers (2024-11-29T07:01:31Z) - Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation.
The first stage involves generating synchronized facial landmarks based on the given speech.
In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.