The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link
between Phonemes and Facial Features
- URL: http://arxiv.org/abs/2307.13953v1
- Date: Wed, 26 Jul 2023 04:08:12 GMT
- Title: The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link
between Phonemes and Facial Features
- Authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj
- Abstract summary: This work unveils the enigmatic link between phonemes and facial features.
From a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face.
Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives.
- Score: 27.89284938655708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work unveils the enigmatic link between phonemes and facial features.
Traditional studies on voice-face correlations typically involve using a long
period of voice input, including generating face images from voices and
reconstructing 3D face meshes from voices. However, in situations like
voice-based crimes, the available voice evidence may be short and limited.
Additionally, from a physiological perspective, each segment of speech --
phoneme -- corresponds to different types of airflow and movements in the face.
Therefore, it is advantageous to discover the hidden link between phonemes and
face attributes. In this paper, we propose an analysis pipeline to help us
explore the voice-face relationship in a fine-grained manner, i.e., phonemes
v.s. facial anthropometric measurements (AM). We build an estimator for each
phoneme-AM pair and evaluate the correlation through hypothesis testing. Our
results indicate that AMs are more predictable from vowels compared to
consonants, particularly with plosives. Additionally, we observe that if a
specific AM exhibits more movement during phoneme pronunciation, it is more
predictable. Our findings support those in physiology regarding correlation and
lay the groundwork for future research on speech-face multimodal learning.
Related papers
- Rethinking Voice-Face Correlation: A Geometry View [34.94679112707095]
We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction.
We find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium.
arXiv Detail & Related papers (2023-07-26T04:03:10Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Affective social anthropomorphic intelligent system [1.7849339006560665]
This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality.
A voice style transfer method is also proposed to map the attributes of a specific emotion.
arXiv Detail & Related papers (2023-04-19T18:24:57Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation.
We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z) - Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in
Real-Time MRI [9.614694312155798]
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production.
The proposed framework comprises of convolutions, recurrent network, and connectionist temporal classification loss, trained entirely end-to-end.
To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's arttory motions captured by rtMRI video.
arXiv Detail & Related papers (2021-06-16T11:20:02Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.