Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
- URL: http://arxiv.org/abs/2204.08451v1
- Date: Mon, 18 Apr 2022 17:58:04 GMT
- Title: Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
- Authors: Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo
Kanazawa, Shiry Ginosar
- Abstract summary: We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
- Score: 89.01668641930206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a framework for modeling interactional communication in dyadic
conversations: given multimodal inputs of a speaker, we autoregressively output
multiple possibilities of corresponding listener motion. We combine the motion
and speech audio of the speaker using a motion-audio cross attention
transformer. Furthermore, we enable non-deterministic prediction by learning a
discrete latent representation of realistic listener motion with a novel
motion-encoding VQ-VAE. Our method organically captures the multimodal and
non-deterministic nature of nonverbal dyadic interactions. Moreover, it
produces realistic 3D listener facial motion synchronous with the speaker (see
video). We demonstrate that our method outperforms baselines qualitatively and
quantitatively via a rich suite of experiments. To facilitate this line of
research, we introduce a novel and large in-the-wild dataset of dyadic
conversations. Code, data, and videos available at
https://evonneng.github.io/learning2listen/.
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Affective Faces for Goal-Driven Dyadic Communication [16.72177738101024]
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation.
Our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context.
arXiv Detail & Related papers (2023-01-26T05:00:09Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation
of Facial Gestures in Dyadic Settings [11.741529272872219]
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors.
Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior.
We introduce a probabilistic method to synthesize interlocutor-aware facial gestures in dyadic conversations.
arXiv Detail & Related papers (2020-06-11T14:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.