HyperLips: Hyper Control Lips with High Resolution Decoder for Talking
Face Generation
- URL: http://arxiv.org/abs/2310.05720v3
- Date: Sun, 15 Oct 2023 02:28:43 GMT
- Title: HyperLips: Hyper Control Lips with High Resolution Decoder for Talking
Face Generation
- Authors: Yaosen Chen, Yu Yao, Zhiqiang Li, Wei Wang, Yanru Zhang, Han Yang,
Xuming Wen
- Abstract summary: HyperLips is a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces.
In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio.
In the second stage, we obtain higher quality face videos through a high-resolution decoder.
- Score: 21.55822398346139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Talking face generation has a wide range of potential applications in the
field of virtual digital humans. However, rendering high-fidelity facial video
while ensuring lip synchronization is still a challenge for existing
audio-driven talking face generation approaches. To address this issue, we
propose HyperLips, a two-stage framework consisting of a hypernetwork for
controlling lips and a high-resolution decoder for rendering high-fidelity
faces. In the first stage, we construct a base face generation network that
uses the hypernetwork to control the encoding latent code of the visual face
information over audio. First, FaceEncoder is used to obtain latent code by
extracting features from the visual face information taken from the video
source containing the face frame.Then, HyperConv, which weighting parameters
are updated by HyperNet with the audio features as input, will modify the
latent code to synchronize the lip movement with the audio. Finally,
FaceDecoder will decode the modified and synchronized latent code into visual
face content. In the second stage, we obtain higher quality face videos through
a high-resolution decoder. To further improve the quality of face generation,
we trained a high-resolution decoder, HRDecoder, using face images and detected
sketches generated from the first stage as input.Extensive quantitative and
qualitative experiments show that our method outperforms state-of-the-art work
with more realistic, high-fidelity, and lip synchronization. Project page:
https://semchan.github.io/HyperLips Project/
Related papers
- MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space [13.59798532129008]
We propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space.
We introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos.
Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency.
arXiv Detail & Related papers (2024-05-09T09:22:09Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - Audio-Visual Face Reenactment [34.79242760137663]
This work proposes a novel method to generate realistic talking head videos using audio and visual streams.
We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints.
We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region.
arXiv Detail & Related papers (2022-10-06T08:48:10Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.