ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
- URL: http://arxiv.org/abs/2308.14448v2
- Date: Mon, 11 Sep 2023 08:56:32 GMT
- Title: ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
- Authors: Yicheng Zhong, Huawei Wei, Peiji Yang, Zhisheng Wang
- Abstract summary: We introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts.
Our method accomplishes expressive facial animation generation and offers enhanced flexibility in effectively conveying the desired style.
- Score: 5.516575655881858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of stylized speech-driven facial animation is to create
animations that encapsulate specific emotional expressions. Existing methods
often depend on pre-established emotional labels or facial expression
templates, which may limit the necessary flexibility for accurately conveying
user intent. In this research, we introduce a technique that enables the
control of arbitrary styles by leveraging natural language as emotion prompts.
This technique presents benefits in terms of both flexibility and
user-friendliness. To realize this objective, we initially construct a
Text-Expression Alignment Dataset (TEAD), wherein each facial expression is
paired with several prompt-like descriptions.We propose an innovative automatic
annotation method, supported by Large Language Models (LLMs), to expedite the
dataset construction, thereby eliminating the substantial expense of manual
annotation. Following this, we utilize TEAD to train a CLIP-based model, termed
ExpCLIP, which encodes text and facial expressions into semantically aligned
style embeddings. The embeddings are subsequently integrated into the facial
animation generator to yield expressive and controllable facial animations.
Given the limited diversity of facial emotions in existing speech-driven facial
animation training data, we further introduce an effective Expression Prompt
Augmentation (EPA) mechanism to enable the animation generator to support
unprecedented richness in style control. Comprehensive experiments illustrate
that our method accomplishes expressive facial animation generation and offers
enhanced flexibility in effectively conveying the desired style.
Related papers
- Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation [66.53435569574135]
Existing facial expression recognition methods typically fine-tune a pre-trained visual encoder using discrete labels.
We observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations.
We propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation.
arXiv Detail & Related papers (2024-09-13T07:28:57Z) - EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation.
It enables fine-grained control of emotion categories and intensities.
It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z) - DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications.
Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion.
We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z) - Towards Localized Fine-Grained Control for Facial Expression Generation [54.82883891478555]
Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent.
Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity.
We propose the use of AUs (action units) for facial expression control in face generation.
arXiv Detail & Related papers (2024-07-25T18:29:48Z) - CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations.
Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions.
This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - GaFET: Learning Geometry-aware Facial Expression Translation from
In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression.
We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z) - Expressive Speech-driven Facial Animation with controllable emotions [12.201573788014622]
This paper presents a novel deep learning-based approach for expressive facial animation generation from speech.
It can exhibit wide-spectrum facial expressions with controllable emotion type and intensity.
It enables emotion-controllable facial animation, where the target expression can be continuously adjusted.
arXiv Detail & Related papers (2023-01-05T11:17:19Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.