StyleLipSync: Style-based Personalized Lip-sync Video Generation
- URL: http://arxiv.org/abs/2305.00521v2
- Date: Mon, 12 Feb 2024 07:17:38 GMT
- Title: StyleLipSync: Style-based Personalized Lip-sync Video Generation
- Authors: Taekyung Ki and Dongchan Min
- Abstract summary: StyleLipSync is a style-based personalized lip-sync video generative model.
Our model can generate accurate lip-sync videos even with the zero-shot setting.
- Score: 2.9914612342004503
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we present StyleLipSync, a style-based personalized lip-sync
video generative model that can generate identity-agnostic lip-synchronizing
video from arbitrary audio. To generate a video of arbitrary identities, we
leverage expressive lip prior from the semantically rich latent space of a
pre-trained StyleGAN, where we can also design a video consistency with a
linear transformation. In contrast to the previous lip-sync methods, we
introduce pose-aware masking that dynamically locates the mask to improve the
naturalness over frames by utilizing a 3D parametric mesh predictor frame by
frame. Moreover, we propose a few-shot lip-sync adaptation method for an
arbitrary person by introducing a sync regularizer that preserves lip-sync
generalization while enhancing the person-specific visual information.
Extensive experiments demonstrate that our model can generate accurate lip-sync
videos even with the zero-shot setting and enhance characteristics of an unseen
face using a few seconds of target video through the proposed adaptation
method.
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - StyleSync: High-Fidelity Generalized and Personalized Lip Sync in
Style-based Generator [85.40502725367506]
We propose StyleSync, an effective framework that enables high-fidelity lip synchronization.
Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face.
Our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames.
arXiv Detail & Related papers (2023-05-09T13:38:13Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.