Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
- URL: http://arxiv.org/abs/2510.02617v1
- Date: Thu, 02 Oct 2025 23:35:52 GMT
- Title: Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
- Authors: Beijia Lu, Ziyi Chen, Jing Xiao, Jun-Yan Zhu,
- Abstract summary: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents.<n>In this work, we distill a many-step diffusion video model into a few-step student model.<n>We propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body.<n>This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence.
- Score: 39.27933931527444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.
Related papers
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching [26.603292632638283]
This paper introduces a distillation-compatible learnable feature caching mechanism for the first time.<n>We employ a lightweight learnable neural predictor instead of traditional training-frees for diffusion models.<n>By undertaking these initiatives, we further push the acceleration boundaries to $11.8times$ while preserving generation quality.
arXiv Detail & Related papers (2026-02-05T08:45:08Z) - LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [35.01134463094784]
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems.<n>Existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this.<n>This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap.
arXiv Detail & Related papers (2025-12-29T16:17:36Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - Taming Consistency Distillation for Accelerated Human Image Animation [47.63111489003292]
DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps.<n>The code and models will be made publicly available.
arXiv Detail & Related papers (2025-04-15T12:44:53Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.<n>Our model achieves 8.5x improvements in generation speed compared to the teacher model.<n>Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - VideoPure: Diffusion-based Adversarial Purification for Video Recognition [21.317424798634086]
We propose the first diffusion-based video purification framework to improve video recognition models' adversarial robustness: VideoPure.<n>We employ temporal DDIM inversion to transform the input distribution into a temporally consistent and trajectory-defined distribution, covering adversarial noise while preserving more video structure.<n>We investigate the defense performance of our method against black-box, gray-box, and adaptive attacks on benchmark datasets and models.
arXiv Detail & Related papers (2025-01-25T00:24:51Z) - Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis [27.43583075023949]
Ditto is a diffusion-based talking head framework that enables fine-grained controls and real-time inference.<n>We show that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.
arXiv Detail & Related papers (2024-11-29T07:01:31Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.