Large language models in textual analysis for gesture selection
- URL: http://arxiv.org/abs/2310.13705v1
- Date: Wed, 4 Oct 2023 14:46:37 GMT
- Title: Large language models in textual analysis for gesture selection
- Authors: Laura B. Hensel, Nutchanon Yongsatianchot, Parisa Torshizi, Elena
Minucci, Stacy Marsella
- Abstract summary: We use large language models (LLMs) to show that these powerful models of large amounts of data can be adapted for gesture analysis and generation.
Specifically, we used ChatGPT as a tool for suggesting context-specific gestures that can realize designer intent based on minimal prompts.
- Score: 2.5169522472327404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gestures perform a variety of communicative functions that powerfully
influence human face-to-face interaction. How this communicative function is
achieved varies greatly between individuals and depends on the role of the
speaker and the context of the interaction. Approaches to automatic gesture
generation vary not only in the degree to which they rely on data-driven
techniques but also the degree to which they can produce context and speaker
specific gestures. However, these approaches face two major challenges: The
first is obtaining sufficient training data that is appropriate for the context
and the goal of the application. The second is related to designer control to
realize their specific intent for the application. Here, we approach these
challenges by using large language models (LLMs) to show that these powerful
models of large amounts of data can be adapted for gesture analysis and
generation. Specifically, we used ChatGPT as a tool for suggesting
context-specific gestures that can realize designer intent based on minimal
prompts. We also find that ChatGPT can suggests novel yet appropriate gestures
not present in the minimal training data. The use of LLMs is a promising avenue
for gesture generation that reduce the need for laborious annotations and has
the potential to flexibly and quickly adapt to different designer intents.
Related papers
- Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.
We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.
We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions.
We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
arXiv Detail & Related papers (2025-03-18T16:49:56Z) - Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.
To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.
Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z) - The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion [46.01825432018138]
We propose a novel framework that unifies verbal and non-verbal language using multimodal language models.
Our model achieves state-of-the-art performance on co-speech gesture generation.
We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications.
arXiv Detail & Related papers (2024-12-13T19:33:48Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.
We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.
We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset [0.39462888523270856]
We propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes.
Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions.
arXiv Detail & Related papers (2024-11-21T14:01:42Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Context-Aware Language Modeling for Goal-Oriented Dialogue Systems [84.65707332816353]
We formulate goal-oriented dialogue as a partially observed Markov decision process.
We derive a simple and effective method to finetune language models in a goal-aware way.
We evaluate our method on a practical flight-booking task using AirDialogue.
arXiv Detail & Related papers (2022-04-18T17:23:11Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.