Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
- URL: http://arxiv.org/abs/2503.03474v1
- Date: Wed, 05 Mar 2025 13:10:07 GMT
- Title: Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
- Authors: Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg,
- Abstract summary: We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.<n>To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.<n>Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
- Score: 56.36041287155606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
Related papers
- Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.
We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.
We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.
An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.
We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z) - Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach [14.5696754689252]
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible.
We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations.
arXiv Detail & Related papers (2024-09-16T10:29:15Z) - Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - Wave to Syntax: Probing spoken language models for syntax [16.643072915927313]
We focus on the encoding of syntax in several self-supervised and visually grounded models of spoken language.
We show that syntax is captured most prominently in the middle layers of the networks, and more explicitly within models with more parameters.
arXiv Detail & Related papers (2023-05-30T11:43:18Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Passing a Non-verbal Turing Test: Evaluating Gesture Animations
Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech.
Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures.
For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.