Understanding Co-speech Gestures in-the-wild
- URL: http://arxiv.org/abs/2503.22668v1
- Date: Fri, 28 Mar 2025 17:55:52 GMT
- Title: Understanding Co-speech Gestures in-the-wild
- Authors: Sindhu B Hegde, K R Prajwal, Taein Kwon, Andrew Zisserman,
- Abstract summary: We introduce a new framework for co-speech gesture understanding in the wild.<n>We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.<n>We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
- Score: 52.5993021523165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal
Related papers
- Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.<n>To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.<n>Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z) - I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue [5.0332064683666005]
We introduce a multimodal reference resolution task centred on representational gestures.<n>We simultaneously tackle the challenge of learning robust gesture embeddings.<n>Our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
arXiv Detail & Related papers (2025-02-27T17:28:12Z) - Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation [11.838249135550662]
Contextual Gesture is a framework that improves co-speech gesture video generation through three innovative components.<n>Our experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications.
arXiv Detail & Related papers (2025-02-11T04:09:12Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.