GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting
- URL: http://arxiv.org/abs/2505.01928v1
- Date: Sat, 03 May 2025 21:44:59 GMT
- Title: GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting
- Authors: Anushka Agarwal, Muhammad Yusuf Hassan, Talha Chafekar,
- Abstract summary: GenSync is a novel framework for multi-identity lip-synced video synthesis.<n>It learns a unified network that synthesizes lip-synced videos for multiple speakers.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.
Related papers
- SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting [25.523486023087916]
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.<n>We introduce SyncTalk++ to address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads.<n>Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second.
arXiv Detail & Related papers (2025-06-17T17:22:12Z) - OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [13.623360048766603]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z) - UniSync: A Unified Framework for Audio-Visual Synchronization [7.120340851879775]
We present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities.<n>We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs.<n>UniSync outperforms existing methods on standard datasets.
arXiv Detail & Related papers (2025-03-20T17:16:03Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality.
We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage.
Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z) - StyleSync: High-Fidelity Generalized and Personalized Lip Sync in
Style-based Generator [85.40502725367506]
We propose StyleSync, an effective framework that enables high-fidelity lip synchronization.
Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face.
Our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames.
arXiv Detail & Related papers (2023-05-09T13:38:13Z) - StyleLipSync: Style-based Personalized Lip-sync Video Generation [2.9914612342004503]
StyleLipSync is a style-based personalized lip-sync video generative model.
Our model can generate accurate lip-sync videos even with the zero-shot setting.
arXiv Detail & Related papers (2023-04-30T16:38:42Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z) - StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.