Related papers: Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality

Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality

URL: http://arxiv.org/abs/2406.12544v2
Date: Wed, 14 Aug 2024 12:25:53 GMT
Title: Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality
Authors: Amelie Sophie Robrecht, Hendric Voss, Lisa Gottschalk, Stefan Kopp,
Abstract summary: This study investigates how different types of gestures influence perceived interaction quality and listener understanding. Our model combines beat gestures generated by a learned speech-driven module with manually captured iconic gestures. Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In human interaction, gestures serve various functions such as marking speech rhythm, highlighting key elements, and supplementing information. These gestures are also observed in explanatory contexts. However, the impact of gestures on explanations provided by virtual agents remains underexplored. A user study was carried out to investigate how different types of gestures influence perceived interaction quality and listener understanding. This study addresses the effect of gestures in explanation by developing an embodied virtual explainer integrating both beat gestures and iconic gestures to enhance its automatically generated verbal explanations. Our model combines beat gestures generated by a learned speech-driven synthesis module with manually captured iconic gestures, supporting the agent's verbal expressions about the board game Quarto! as an explanation scenario. Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding. Nonetheless, compared to prior research, the embodied agent significantly enhances understanding.

Related papers

Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild. We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z)
Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
arXiv Detail & Related papers (2025-03-18T16:49:56Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures. We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures. We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence. We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z)
Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation [4.216085185442862]
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This paper employs self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information.
arXiv Detail & Related papers (2024-08-31T08:53:18Z)
Nonverbal Interaction Detection [83.40522919429337]
This work addresses a new challenge of understanding human nonverbal interaction in social contexts. We contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups. Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form individual, group, interaction> from images. Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs.
arXiv Detail & Related papers (2024-07-11T02:14:06Z)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency. Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
Iconic Gesture Semantics [87.00251241246136]
Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model. We argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks. An iconic gesture semantics is introduced which covers the full range from gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
arXiv Detail & Related papers (2024-04-29T13:58:03Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
Multimodal analysis of the predictability of hand-gesture properties [10.332200713176768]
Embodied conversational agents benefit from being able to accompany their speech with gestures. We investigate which gesture properties can be predicted from speech text and/or audio using contemporary deep learning.
arXiv Detail & Related papers (2021-08-12T14:16:00Z)
Gesticulator: A framework for semantically-aware speech-driven gesture generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.