Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation
of Facial Gestures in Dyadic Settings
- URL: http://arxiv.org/abs/2006.09888v2
- Date: Thu, 22 Oct 2020 21:22:20 GMT
- Title: Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation
of Facial Gestures in Dyadic Settings
- Authors: Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow
- Abstract summary: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors.
Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior.
We introduce a probabilistic method to synthesize interlocutor-aware facial gestures in dyadic conversations.
- Score: 11.741529272872219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To enable more natural face-to-face interactions, conversational agents need
to adapt their behavior to their interlocutors. One key aspect of this is
generation of appropriate non-verbal behavior for the agent, for example facial
gestures, here defined as facial expressions and head movements. Most existing
gesture-generating systems do not utilize multi-modal cues from the
interlocutor when synthesizing non-verbal behavior. Those that do, typically
use deterministic methods that risk producing repetitive and non-vivid motions.
In this paper, we introduce a probabilistic method to synthesize
interlocutor-aware facial gestures - represented by highly expressive FLAME
parameters - in dyadic conversations. Our contributions are: a) a method for
feature extraction from multi-party video and speech recordings, resulting in a
representation that allows for independent control and manipulation of
expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a
recent motion-synthesis method based on normalizing flows, to also take
multi-modal signals from the interlocutor as input and subsequently output
interlocutor-aware facial gestures; and c) a subjective evaluation assessing
the use and relative importance of the input modalities. The results show that
the model successfully leverages the input from the interlocutor to generate
more appropriate behavior. Videos, data, and code available at:
https://jonepatr.github.io/lets_face_it.
Related papers
- High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Passing a Non-verbal Turing Test: Evaluating Gesture Animations
Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech.
Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures.
For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.