Modeling Turn-Taking with Semantically Informed Gestures
- URL: http://arxiv.org/abs/2510.19350v1
- Date: Wed, 22 Oct 2025 08:17:54 GMT
- Title: Modeling Turn-Taking with Semantically Informed Gestures
- Authors: Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg,
- Abstract summary: We introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations.<n>We model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures.<n>Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines.
- Score: 56.31369237947851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
Related papers
- Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization [0.43348187554755113]
The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text.<n>Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns.<n>We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors.
arXiv Detail & Related papers (2025-09-11T19:14:57Z) - SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning [0.6249768559720122]
We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
arXiv Detail & Related papers (2025-07-25T15:10:15Z) - Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.<n>To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.<n>Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.