Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices
- URL: http://arxiv.org/abs/2104.10299v1
- Date: Wed, 21 Apr 2021 01:14:50 GMT
- Title: Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices
- Authors: Cho-Ying Wu, Ke Xu, Chin-Cheng Hsu, Ulrich Neumann
- Abstract summary: This work focuses on the analysis that whether 3D face models can be learned from only the speech inputs of speakers.
We propose both the supervised learning and unsupervised learning frameworks. Especially we demonstrate how unsupervised learning is possible in the absence of a direct voice-to-3D-face dataset.
- Score: 18.600534152951926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work focuses on the analysis that whether 3D face models can be learned
from only the speech inputs of speakers. Previous works for cross-modal face
synthesis study image generation from voices. However, image synthesis includes
variations such as hairstyles, backgrounds, and facial textures, that are
arguably irrelevant to voice or without direct studies to show correlations. We
instead investigate the ability to reconstruct 3D faces to concentrate on only
geometry, which is more physiologically grounded. We propose both the
supervised learning and unsupervised learning frameworks. Especially we
demonstrate how unsupervised learning is possible in the absence of a direct
voice-to-3D-face dataset under limited availability of 3D face scans when the
model is equipped with knowledge distillation. To evaluate the performance, we
also propose several metrics to measure the geometric fitness of two 3D faces
based on points, lines, and regions. We find that 3D face shapes can be
reconstructed from voices. Experimental results suggest that 3D faces can be
reconstructed from voices, and our method can improve the performance over the
baseline. The best performance gains (15% - 20%) on ear-to-ear distance ratio
metric (ER) coincides with the intuition that one can roughly envision whether
a speaker's face is overall wider or thinner only from a person's voice. See
our project page for codes and data.
Related papers
- NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head.
Our method can craft a 3D-consistent facial feature space corresponding to a single image.
We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z) - Learn2Talk: 3D Talking Face Learns from 2D Talking Face [15.99315075587735]
We propose a learning framework named Learn2Talk, which can construct a better 3D talking face network.
Inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync.
A teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network.
arXiv Detail & Related papers (2024-04-19T13:45:14Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Rethinking Voice-Face Correlation: A Geometry View [34.94679112707095]
We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction.
We find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium.
arXiv Detail & Related papers (2023-07-26T04:03:10Z) - Generating 2D and 3D Master Faces for Dictionary Attacks with a
Network-Assisted Latent Space Evolution [68.8204255655161]
A master face is a face image that passes face-based identity authentication for a high percentage of the population.
We optimize these faces for 2D and 3D face verification models.
In 3D, we generate faces using the 2D StyleGAN2 generator and predict a 3D structure using a deep 3D face reconstruction network.
arXiv Detail & Related papers (2022-11-25T09:15:38Z) - Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head
Synthesis [90.43371339871105]
We propose Dynamic Facial Radiance Fields (DFRF) for few-shot talking head synthesis.
DFRF conditions face radiance field on 2D appearance images to learn the face prior.
Experiments show DFRF can synthesize natural and high-quality audio-driven talking head videos for novel identities with only 40k iterations.
arXiv Detail & Related papers (2022-07-24T16:46:03Z) - Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? [16.716830359688853]
This work digs into a root question in human perception: can face geometry be gleaned from one's voices?
We propose our analysis framework, Cross-Modal Perceptionist, under both supervised and unsupervised learning.
arXiv Detail & Related papers (2022-03-18T10:03:07Z) - 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [13.305263646852087]
We introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions.
We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons.
arXiv Detail & Related papers (2021-04-25T02:48:19Z) - Deep 3D Portrait from a Single Image [54.634207317528364]
We present a learning-based approach for recovering the 3D geometry of human head from a single portrait image.
A two-step geometry learning scheme is proposed to learn 3D head reconstruction from in-the-wild face images.
We evaluate the accuracy of our method both in 3D and with pose manipulation tasks on 2D images.
arXiv Detail & Related papers (2020-04-24T08:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.