From Inference to Generation: End-to-end Fully Self-supervised
Generation of Human Face from Speech
- URL: http://arxiv.org/abs/2004.05830v1
- Date: Mon, 13 Apr 2020 09:01:49 GMT
- Title: From Inference to Generation: End-to-end Fully Self-supervised
Generation of Human Face from Speech
- Authors: Hyeong-Seok Choi, Changdae Park, Kyogu Lee
- Abstract summary: We propose a multi-modal learning framework that links the inference stage and generation stage.
The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform.
Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech.
- Score: 20.41722156886205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work seeks the possibility of generating the human face from voice
solely based on the audio-visual data without any human-labeled annotations. To
this end, we propose a multi-modal learning framework that links the inference
stage and generation stage. First, the inference networks are trained to match
the speaker identity between the two different modalities. Then the trained
inference networks cooperate with the generation network by giving conditional
information about the voice. The proposed method exploits the recent
development of GANs techniques and generates the human face directly from the
speech waveform making our system fully end-to-end. We analyze the extent to
which the network can naturally disentangle two latent factors that contribute
to the generation of a face image - one that comes directly from a speech
signal and the other that is not related to it - and explore whether the
network can learn to generate natural human face image distribution by modeling
these factors. Experimental results show that the proposed network can not only
match the relationship between the human face and speech, but can also generate
the high-quality human face sample conditioned on its speech. Finally, the
correlation between the generated face and the corresponding speech is
quantitatively measured to analyze the relationship between the two modalities.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z) - Articulation GAN: Unsupervised modeling of articulatory learning [6.118463549086599]
We introduce the Articulatory Generator to the Generative Adrial Network paradigm.
A separate pre-trained physical model transforms the generated EMA representations to speech waveforms.
Articulatory analysis of the generated EMA representations suggests that the network learns to control articulators in a manner that closely follows human articulators during speech production.
arXiv Detail & Related papers (2022-10-27T05:07:04Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation.
We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.