InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
- URL: http://arxiv.org/abs/2508.14033v1
- Date: Tue, 19 Aug 2025 17:55:23 GMT
- Title: InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
- Authors: Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei,
- Abstract summary: We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
- Score: 66.48064661467781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.
Related papers
- From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing [24.998261989251976]
We propose a self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem.<n>Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data.<n>A DiDubT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete input video frames to focus solely on precise, audio-driven lip modifications.
arXiv Detail & Related papers (2025-12-31T18:58:30Z) - MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control [48.94486508604052]
MAGIC-Talk is a one-shot diffusion-based framework for customizable talking face generation.<n> ReferenceNet preserves identity and enables fine-grained facial editing via text prompts.<n>AnimateNet enhances motion coherence using structured motion priors.
arXiv Detail & Related papers (2025-10-26T19:49:31Z) - InfinityHuman: Towards Long-Term Audio-Driven Human [37.55371306203722]
Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability.<n>We propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos.<n> Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync.
arXiv Detail & Related papers (2025-08-27T18:36:30Z) - MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z) - SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z) - OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [13.623360048766603]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel AI-empowered storytelling framework to enable seamless video generation with natural action transitions and structured narratives.<n>We first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video.<n>We then introduce a dynamics-informed prompt weighting mechanism that adaptively adjusts the influence of scene and action prompts at each diffusion timestep.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z) - PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation [4.019144083959918]
We present TANGO, a framework for generating co-speech body-gesture videos.
Given a few-minute, single-speaker reference video, TANGO produces high-fidelity videos with synchronized body gestures.
arXiv Detail & Related papers (2024-10-05T16:30:46Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.