Related papers: Identity-Preserving Video Dubbing Using Motion Warping

Identity-Preserving Video Dubbing Using Motion Warping

URL: http://arxiv.org/abs/2501.04586v2
Date: Thu, 09 Jan 2025 15:27:58 GMT
Title: Identity-Preserving Video Dubbing Using Motion Warping
Authors: Runzhen Liu, Qinjie Lin, Yunfei Liu, Lijian Lin, Ye Zhu, Yu Li, Chuhua Xian, Fa-Ting Hong,
Abstract summary: Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal.<n>We propose IPTalker, a framework for video dubbing that achieves seamless alignment between driving audio and reference identity.<n>IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention.
Score: 26.10803670509977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.

Related papers

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model [37.45649985771698]
We propose sync audio-video customization, which aims to synchronously customize both video identity and audio timbre.<n>This novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio.<n>To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework.
arXiv Detail & Related papers (2026-02-12T03:25:41Z)
Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework [56.30142869506262]
Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion.<n>This mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio.<n>We propose a systematic evaluation methodology to analyze and quantify lip leakage.
arXiv Detail & Related papers (2025-11-05T17:11:53Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z)
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [13.623360048766603]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z)
Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter [10.608872317957026]
"lip averaging" phenomenon occurs when a model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. We propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences.
arXiv Detail & Related papers (2025-03-09T02:36:31Z)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [34.43272121705662]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training. It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z)
RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality. Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z)
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.