StyleSync: High-Fidelity Generalized and Personalized Lip Sync in
Style-based Generator
- URL: http://arxiv.org/abs/2305.05445v1
- Date: Tue, 9 May 2023 13:38:13 GMT
- Title: StyleSync: High-Fidelity Generalized and Personalized Lip Sync in
Style-based Generator
- Authors: Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang,
Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong
Wang
- Abstract summary: We propose StyleSync, an effective framework that enables high-fidelity lip synchronization.
Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face.
Our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames.
- Score: 85.40502725367506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in syncing lip movements with any audio waves,
current methods still struggle to balance generation quality and the model's
generalization ability. Previous studies either require long-term data for
training or produce a similar movement pattern on all subjects with low
quality. In this paper, we propose StyleSync, an effective framework that
enables high-fidelity lip synchronization. We identify that a style-based
generator would sufficiently enable such a charming property on both one-shot
and few-shot scenarios. Specifically, we design a mask-guided spatial
information encoding module that preserves the details of the given face. The
mouth shapes are accurately modified by audio through modulated convolutions.
Moreover, our design also enables personalized lip-sync by introducing style
space and generator refinement on only limited frames. Thus the identity and
talking style of a target person could be accurately preserved. Extensive
experiments demonstrate the effectiveness of our method in producing
high-fidelity results on a variety of scenes. Resources can be found at
https://hangz-nju-cuhk.github.io/projects/StyleSync.
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space.
We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos.
Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z) - StyleLipSync: Style-based Personalized Lip-sync Video Generation [2.9914612342004503]
StyleLipSync is a style-based personalized lip-sync video generative model.
Our model can generate accurate lip-sync videos even with the zero-shot setting.
arXiv Detail & Related papers (2023-04-30T16:38:42Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.