SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild
- URL: http://arxiv.org/abs/2512.21736v2
- Date: Tue, 30 Dec 2025 03:29:33 GMT
- Title: SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild
- Authors: Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang,
- Abstract summary: SyncAnyone is a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously.<n>We develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video.<n>We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency.
- Score: 16.692450893925148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.
Related papers
- From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing [24.998261989251976]
We propose a self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem.<n>Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data.<n>A DiDubT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete input video frames to focus solely on precise, audio-driven lip modifications.
arXiv Detail & Related papers (2025-12-31T18:58:30Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z) - OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [18.187498205054748]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z) - FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.<n>In the first stage, we employ a clip-level training scheme to establish coherent global motion.<n>In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.<n>Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z) - MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling [12.438835523353347]
Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs.<n>We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and data sampling strategy.<n>MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.