FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
- URL: http://arxiv.org/abs/2403.06375v3
- Date: Tue, 23 Apr 2024 03:35:53 GMT
- Title: FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
- Authors: Shuai Tan, Bin Ji, Ye Pan,
- Abstract summary: This paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces.
Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space.
Our designed vector-quantization image generator treats the creation of expressive facial images as a code query task.
- Score: 4.429892245774265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.
Related papers
- Towards Localized Fine-Grained Control for Facial Expression Generation [54.82883891478555]
Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent.
Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity.
We propose the use of AUs (action units) for facial expression control in face generation.
arXiv Detail & Related papers (2024-07-25T18:29:48Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation [12.044308738509402]
We propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables.
This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning.
Our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.
arXiv Detail & Related papers (2024-06-12T06:00:00Z) - CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations.
Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions.
This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - GMTalker: Gaussian Mixture-based Audio-Driven Emotional talking video Portraits [37.12506653015298]
We present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework.
Specifically, we propose a continuous and disentangled latent space, achieving more flexible emotion manipulation.
We also introduce a normalizing flow-based motion generator pretrained on a large dataset to generate diverse head poses, blinks, and eyeball movements.
arXiv Detail & Related papers (2023-12-12T19:03:04Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - High-fidelity Generalized Emotional Talking Face Generation with
Multi-modal Emotion Space Learning [43.09015109281053]
We propose a more flexible and generalized framework for talking face generation.
Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space.
An Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation.
arXiv Detail & Related papers (2023-05-04T05:59:34Z) - Emotionally Enhanced Talking Face Generation [52.07451348895041]
We build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions.
We show that our model can adapt to arbitrary identities, emotions, and languages.
Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
arXiv Detail & Related papers (2023-03-21T02:33:27Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.