Related papers: GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

URL: http://arxiv.org/abs/2505.01928v1
Date: Sat, 03 May 2025 21:44:59 GMT
Title: GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting
Authors: Anushka Agarwal, Muhammad Yusuf Hassan, Talha Chafekar,
Abstract summary: GenSync is a novel framework for multi-identity lip-synced video synthesis.<n>It learns a unified network that synthesizes lip-synced videos for multiple speakers.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.

Related papers

SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting [25.523486023087916]
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.<n>We introduce SyncTalk++ to address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads.<n>Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second.
arXiv Detail & Related papers (2025-06-17T17:22:12Z)
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers [13.623360048766603]
We present OmniSync, a universal lip synchronization framework for diverse visual scenarios.<n>Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks.<n>We also establish the AIGCLipSync Benchmark, the first evaluation suite for lip sync in AI-generated videos.
arXiv Detail & Related papers (2025-05-27T17:20:38Z)
UniSync: A Unified Framework for Audio-Visual Synchronization [7.120340851879775]
We present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities.<n>We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs.<n>UniSync outperforms existing methods on standard datasets.
arXiv Detail & Related papers (2025-03-20T17:16:03Z)
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training. It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z)
Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z)
GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z)
Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality. We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z)
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator [85.40502725367506]
We propose StyleSync, an effective framework that enables high-fidelity lip synchronization. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. Our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames.
arXiv Detail & Related papers (2023-05-09T13:38:13Z)
StyleLipSync: Style-based Personalized Lip-sync Video Generation [2.9914612342004503]
StyleLipSync is a style-based personalized lip-sync video generative model. Our model can generate accurate lip-sync videos even with the zero-shot setting.
arXiv Detail & Related papers (2023-04-30T16:38:42Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio. It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z)
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model. It can synthesize a video of a talking person from a single reference image. Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.