Related papers: Data standardization for robust lip sync

Data standardization for robust lip sync

URL: http://arxiv.org/abs/2202.06198v3
Date: Mon, 9 Sep 2024 03:11:17 GMT
Title: Data standardization for robust lip sync
Authors: Chun Wang,
Abstract summary: Existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information difficult. This paper proposes a data standardization pipeline to standardize the visual input for lip sync.
Score: 10.235718439446044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information difficult. To address these issues, this paper proposes a data standardization pipeline to standardize the visual input for lip sync. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle lip motion information from the raw images. Then, standardized images are synthesized with disentangled lip motion information, with all other attributes related to distracting factors set to predefined values independent of the input, to reduce their effects. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active speaker detection task.

Related papers

SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild [16.692450893925148]
SyncAnyone is a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously.<n>We develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video.<n>We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency.
arXiv Detail & Related papers (2025-12-25T16:49:40Z)
Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework [56.30142869506262]
Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion.<n>This mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio.<n>We propose a systematic evaluation methodology to analyze and quantify lip leakage.
arXiv Detail & Related papers (2025-11-05T17:11:53Z)
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [13.623360048766603]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z)
KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution [32.124841838431166]
Lip synchronization presents significant new challenges such as expression leakage from the input video. We present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency. We show that KeySync state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak.
arXiv Detail & Related papers (2025-05-01T12:56:17Z)
SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals. We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z)
OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance [13.050998759819933]
"OpFlowTalker" is a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions. It smooths image transitions and aligns changes with semantic content. We also developed an optical flow synchronization module that regulates both full-face and lip movements.
arXiv Detail & Related papers (2024-05-23T15:42:34Z)
Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z)
SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets. We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z)
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance. We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z)
Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality. We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z)
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. Previous studies revealed the importance of lip-speech synchronization and visual quality. We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z)
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.