VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style
Transfer
- URL: http://arxiv.org/abs/2308.04830v2
- Date: Fri, 11 Aug 2023 05:56:35 GMT
- Title: VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style
Transfer
- Authors: Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan,
Sheng Zhao
- Abstract summary: This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars.
Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements.
With our essential designs on facial style learning, our model is able to flexibly capture the expressive style from arbitrary video prompts and transfer it onto a personalized image in a zero-shot manner.
- Score: 38.294607144065566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current talking face generation methods mainly focus on speech-lip
synchronization. However, insufficient investigation on the facial talking
style leads to a lifeless and monotonous avatar. Most previous works fail to
imitate expressive styles from arbitrary video prompts and ensure the
authenticity of the generated video. This paper proposes an unsupervised
variational style transfer model (VAST) to vivify the neutral photo-realistic
avatars. Our model consists of three key components: a style encoder that
extracts facial style representations from the given video prompts; a hybrid
facial expression decoder to model accurate speech-related movements; a
variational style enhancer that enhances the style space to be highly
expressive and meaningful. With our essential designs on facial style learning,
our model is able to flexibly capture the expressive facial style from
arbitrary video prompts and transfer it onto a personalized image renderer in a
zero-shot manner. Experimental results demonstrate the proposed approach
contributes to a more vivid talking avatar with higher authenticity and richer
expressiveness.
Related papers
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment [5.516575655881858]
We introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts.
Our method accomplishes expressive facial animation generation and offers enhanced flexibility in effectively conveying the desired style.
arXiv Detail & Related papers (2023-08-28T09:35:13Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.