DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
- URL: http://arxiv.org/abs/2310.00434v2
- Date: Tue, 14 May 2024 13:12:39 GMT
- Title: DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
- Authors: Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-Jin Liu,
- Abstract summary: We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
- Score: 24.401443462720135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .
Related papers
- MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos.
MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG.
Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z) - StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads [46.749597670092484]
Existing one-shot talking head methods fail to produce diverse speaking styles in the final videos.
We propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference videos.
Our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
arXiv Detail & Related papers (2024-09-14T03:49:38Z) - Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features [22.31865247379668]
Speech-driven 3D facial animation has improved a lot recently.
Most related works only utilize acoustic modality and neglect the influence of visual and textual cues.
We present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation.
arXiv Detail & Related papers (2023-12-05T14:12:38Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion [0.0]
We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations.
Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input.
We also introduce a new in-house dataset that is based on a blendshape based rigged character.
arXiv Detail & Related papers (2023-09-20T13:33:00Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis [17.650661515807993]
We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video.
We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
arXiv Detail & Related papers (2021-10-30T08:15:27Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.