Related papers: GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

URL: http://arxiv.org/abs/2310.12821v4
Date: Fri, 21 Jun 2024 10:12:41 GMT
Title: GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents
Authors: Xin Zeng, Xiaoyu Wang, Tengxiang Zhang, Chun Yu, Shengdong Zhao, Yiqiang Chen,
Abstract summary: We introduce GestureGPT, a free-form hand gesture understanding framework that does not require users to learn, demonstrate, or associate gestures. Our framework leverages the large language model's astute common sense and strong inference ability to understand a spontaneously performed gesture. We validated our conceptual framework under two real-world scenarios: smart home controlling and online video streaming.
Score: 35.48323584634582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current gesture interfaces typically demand users to learn and perform gestures from a predefined set, which leads to a less natural experience. Interfaces supporting user-defined gestures eliminate the learning process, but users still need to demonstrate and associate the gesture to a specific system function themselves. We introduce GestureGPT, a free-form hand gesture understanding framework that does not require users to learn, demonstrate, or associate gestures. Our framework leverages the large language model's (LLM) astute common sense and strong inference ability to understand a spontaneously performed gesture from its natural language descriptions, and automatically maps it to a function provided by the interface. More specifically, our triple-agent framework involves a Gesture Description Agent that automatically segments and formulates natural language descriptions of hand poses and movements based on hand landmark coordinates. The description is deciphered by a Gesture Inference Agent through self-reasoning and querying about the interaction context (e.g., interaction history, gaze data), which a Context Management Agent organizes and provides. Following iterative exchanges, the Gesture Inference Agent discerns user intent, grounding it to an interactive function. We validated our conceptual framework under two real-world scenarios: smart home controlling and online video streaming. The average zero-shot Top-5 grounding accuracies are 83.59% for smart home tasks and 73.44% for video streaming. We also provided an extensive discussion of our framework including model selection rationale, generated description quality, generalizability etc.

Related papers

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild. We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z)
Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
arXiv Detail & Related papers (2025-03-18T16:49:56Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)
Gesture2Path: Imitation Learning for Gesture-aware Navigation [54.570943577423094]
We present Gesture2Path, a novel social navigation approach that combines image-based imitation learning with model-predictive control. We deploy our method on real robots and showcase the effectiveness of our approach for the four gestures-navigation scenarios.
arXiv Detail & Related papers (2022-09-19T23:05:36Z)
The Gesture Authoring Space: Authoring Customised Hand Gestures for Grasping Virtual Objects in Immersive Virtual Environments [81.5101473684021]
This work proposes a hand gesture authoring tool for object specific grab gestures allowing virtual objects to be grabbed as in the real world. The presented solution uses template matching for gesture recognition and requires no technical knowledge to design and create custom tailored hand gestures. The study showed that gestures created with the proposed approach are perceived by users as a more natural input modality than the others.
arXiv Detail & Related papers (2022-07-03T18:33:33Z)
Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene [34.1812210095966]
We develop a VR-based 3D simulation environment, named Ges-THOR, based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task--learn to communicate and communicate to learn.
arXiv Detail & Related papers (2021-08-05T20:56:47Z)
A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents [0.0]
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof-of-concept framework which is intended to facilitate evaluation of modern gesture generation models in interaction.
arXiv Detail & Related papers (2021-02-24T14:31:21Z)
Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)
Gesticulator: A framework for semantically-aware speech-driven gesture generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.