Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
- URL: http://arxiv.org/abs/2203.09824v1
- Date: Fri, 18 Mar 2022 10:03:07 GMT
- Title: Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
- Authors: Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann
- Abstract summary: This work digs into a root question in human perception: can face geometry be gleaned from one's voices?
We propose our analysis framework, Cross-Modal Perceptionist, under both supervised and unsupervised learning.
- Score: 16.716830359688853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work digs into a root question in human perception: can face geometry be
gleaned from one's voices? Previous works that study this question only adopt
developments in image synthesis and convert voices into face images to show
correlations, but working on the image domain unavoidably involves predicting
attributes that voices cannot hint, including facial textures, hairstyles, and
backgrounds. We instead investigate the ability to reconstruct 3D faces to
concentrate on only geometry, which is much more physiologically grounded. We
propose our analysis framework, Cross-Modal Perceptionist, under both
supervised and unsupervised learning. First, we construct a dataset,
Voxceleb-3D, which extends Voxceleb and includes paired voices and face meshes,
making supervised learning possible. Second, we use a knowledge distillation
mechanism to study whether face geometry can still be gleaned from voices
without paired voices and 3D face data under limited availability of 3D face
scans. We break down the core question into four parts and perform visual and
numerical analyses as responses to the core question. Our findings echo those
in physiology and neuroscience about the correlation between voices and facial
structures. The work provides future human-centric cross-modal learning with
explainable foundations. See our project page:
https://choyingw.github.io/works/Voice2Mesh/index.html
Related papers
- ChatAnything: Facetime Chat with LLM-Enhanced Personas [87.76804680223003]
We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones.
MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
arXiv Detail & Related papers (2023-11-12T08:29:41Z) - The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link
between Phonemes and Facial Features [27.89284938655708]
This work unveils the enigmatic link between phonemes and facial features.
From a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face.
Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives.
arXiv Detail & Related papers (2023-07-26T04:08:12Z) - Rethinking Voice-Face Correlation: A Geometry View [34.94679112707095]
We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction.
We find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium.
arXiv Detail & Related papers (2023-07-26T04:03:10Z) - EgoBody: Human Body Shape, Motion and Social Interactions from
Head-Mounted Devices [76.50816193153098]
EgoBody is a novel large-scale dataset for social interactions in complex 3D scenes.
We employ Microsoft HoloLens2 headsets to record rich egocentric data streams including RGB, depth, eye gaze, head and hand tracking.
To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames.
arXiv Detail & Related papers (2021-12-14T18:41:28Z) - Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation.
We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z) - 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [13.305263646852087]
We introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions.
We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons.
arXiv Detail & Related papers (2021-04-25T02:48:19Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices [18.600534152951926]
This work focuses on the analysis that whether 3D face models can be learned from only the speech inputs of speakers.
We propose both the supervised learning and unsupervised learning frameworks. Especially we demonstrate how unsupervised learning is possible in the absence of a direct voice-to-3D-face dataset.
arXiv Detail & Related papers (2021-04-21T01:14:50Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Deep 3D Portrait from a Single Image [54.634207317528364]
We present a learning-based approach for recovering the 3D geometry of human head from a single portrait image.
A two-step geometry learning scheme is proposed to learn 3D head reconstruction from in-the-wild face images.
We evaluate the accuracy of our method both in 3D and with pose manipulation tasks on 2D images.
arXiv Detail & Related papers (2020-04-24T08:55:37Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.