Leveraging WaveNet for Dynamic Listening Head Modeling from Speech
- URL: http://arxiv.org/abs/2409.05089v1
- Date: Sun, 8 Sep 2024 13:19:22 GMT
- Title: Leveraging WaveNet for Dynamic Listening Head Modeling from Speech
- Authors: Minh-Duc Nguyen, Hyung-Jeong Yang, Seung-Won Kim, Ji-Eun Shin, Soo-Hyung Kim,
- Abstract summary: The creation of listener facial responses aims to simulate interactive communication feedback from a listener during a face-to-face conversation.
Our approach focuses on capturing the subtle nuances of listener feedback, ensuring the preservation of individual listener identity.
- Score: 11.016004057765185
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The creation of listener facial responses aims to simulate interactive communication feedback from a listener during a face-to-face conversation. Our goal is to generate believable videos of listeners' heads that respond authentically to a single speaker by a sequence-to-sequence model with an combination of WaveNet and Long short-term memory network. Our approach focuses on capturing the subtle nuances of listener feedback, ensuring the preservation of individual listener identity while expressing appropriate attitudes and viewpoints. Experiment results show that our method surpasses the baseline models on ViCo benchmark Dataset.
Related papers
- CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation [9.741109135330262]
Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.
We propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation.
arXiv Detail & Related papers (2024-03-01T04:31:56Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model [14.220727407255966]
Responsive listening head generation is an important task that aims to model face-to-face communication scenarios.
We propose the textbfMulti-textbfFaceted textbfResponsive Listening Head Generation Network (MFR-Net)
arXiv Detail & Related papers (2023-08-31T11:10:28Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Modeling Speaker-Listener Interaction for Backchannel Prediction [24.52345279975304]
Backchanneling theories emphasize the active and continuous role of the listener in the course of a conversation.
We propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech.
Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions.
arXiv Detail & Related papers (2023-04-10T09:22:06Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.