Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation
- URL: http://arxiv.org/abs/2409.10535v1
- Date: Sat, 31 Aug 2024 08:53:18 GMT
- Title: Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation
- Authors: Esam Ghaleb, Bulat Khaertdinov, Wim Pouw, Marlou Rasenberg, Judith Holler, Aslı Özyürek, Raquel Fernández,
- Abstract summary: In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors.
How can we learn meaningful gestures representations considering gestures' variability and relationship with speech?
This paper employs self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information.
- Score: 4.216085185442862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.
Related papers
- Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z) - Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality [0.0]
This study investigates how different types of gestures influence perceived interaction quality and listener understanding.
Our model combines beat gestures generated by a learned speech-driven module with manually captured iconic gestures.
Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding.
arXiv Detail & Related papers (2024-06-18T12:23:00Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Leveraging Speech for Gesture Detection in Multimodal Communication [3.798147784987455]
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system.
Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech.
This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures.
arXiv Detail & Related papers (2024-04-23T11:54:05Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Diversifying Joint Vision-Language Tokenization Learning [51.82353485389527]
Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.
We propose joint vision-language representation learning by diversifying the tokenization learning process.
arXiv Detail & Related papers (2023-06-06T05:41:42Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multimodal analysis of the predictability of hand-gesture properties [10.332200713176768]
Embodied conversational agents benefit from being able to accompany their speech with gestures.
We investigate which gesture properties can be predicted from speech text and/or audio using contemporary deep learning.
arXiv Detail & Related papers (2021-08-12T14:16:00Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.