Related papers: From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

URL: http://arxiv.org/abs/2512.25066v1
Date: Wed, 31 Dec 2025 18:58:30 GMT
Title: From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing
Authors: Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu,
Abstract summary: We propose a self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem.<n>Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data.<n>A DiDubT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete input video frames to focus solely on precise, audio-driven lip modifications.
Score: 24.998261989251976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

Related papers

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers [3.3508228801277853]
We introduce EditYourself, a DiTT-based framework for audio-driven video-to-videoV editing.<n>It enables transcript-based modification of talking videos, including the seamless addition, removal, and retiming of visually spoken content.<n>This represents a step toward generative video models as practical tools for professional video post-production.
arXiv Detail & Related papers (2026-01-29T18:49:27Z)
SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild [16.692450893925148]
SyncAnyone is a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously.<n>We develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video.<n>We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency.
arXiv Detail & Related papers (2025-12-25T16:49:40Z)
StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z)
Video Editing for Audio-Visual Dubbing [11.063156506583562]
EdiDub is a novel framework that reformulates visual dubbing as a content-aware editing task.<n>It preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications.
arXiv Detail & Related papers (2025-05-29T12:56:09Z)
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [18.187498205054748]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z)
RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing [82.132107140504]
We introduce a training-free universal portrait video editing framework that provides a versatile and adaptable editing strategy.<n>It supports portrait appearance editing conditioned on the changed first reference frame, as well as lip editing conditioned on varied speech.<n>Our model can achieve more accurate and synchronized lip movements for the lip editing task, as well as more flexible motion transfer for the appearance editing task.
arXiv Detail & Related papers (2025-03-14T16:39:15Z)
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training. It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.