Related papers: Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

URL: http://arxiv.org/abs/2312.10877v1
Date: Mon, 18 Dec 2023 01:49:42 GMT
Title: Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
Authors: Hui Fu, Zeqing Wang, Ke Gong, Keze Wang, Tianshui Chen, Haojie Li, Haifeng Zeng, Wenxiong Kang
Abstract summary: Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding. We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
Score: 41.489700112318864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called \textbf{Mimic} to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/

Related papers

Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation [8.75374562753977]
Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio.<n>Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth.<n>We propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions.
arXiv Detail & Related papers (2025-07-28T07:04:50Z)
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization [12.143710013809322]
Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker's speaking style.<n>Previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference.<n>We propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input.
arXiv Detail & Related papers (2025-07-28T06:47:59Z)
Model See Model Do: Speech-Driven Facial Animation with Style Control [14.506128477193991]
Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation.<n>Existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions.<n>We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip.
arXiv Detail & Related papers (2025-05-02T14:47:21Z)
ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model [41.35209566957009]
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. We introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks.
arXiv Detail & Related papers (2025-02-27T17:49:01Z)
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z)
3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z)
Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility. We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles) Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z)
AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach. It learns the personalized talking style from a reference video of about 10 seconds. It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z)
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. It captures the complex one-to-many relationships between speech and 3D face based on diffusion. It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z)
FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method. It is very robust to background noise and can handle audio recorded in a variety of situations. It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z)
Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor. We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video. We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z)
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.