Related papers: Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

URL: http://arxiv.org/abs/2408.04127v1
Date: Wed, 7 Aug 2024 23:23:50 GMT
Title: Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Authors: Anna Deichler, Simon Alexanderson, Jonas Beskow,
Abstract summary: This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis.
Score: 17.299991009921307
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Related papers

Audio-driven Gesture Generation via Deviation Feature in the Latent Space [2.8952735126314733]
We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
arXiv Detail & Related papers (2025-03-27T15:37:16Z)
Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
arXiv Detail & Related papers (2025-03-18T16:49:56Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures. We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures. We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
Autonomous Character-Scene Interaction Synthesis from Text Instruction [45.255215402142596]
We introduce a framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage. We present a comprehensive motion-captured dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions.
arXiv Detail & Related papers (2024-10-04T06:58:45Z)
Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation. Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize. We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
THOR: Text to Human-Object Interaction Diffusion via Relation Intervention [51.02435289160616]
We propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR) In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. We construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset.
arXiv Detail & Related papers (2024-03-17T13:17:25Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model Agents [35.48323584634582]
We introduce GestureGPT, a free-form hand gesture understanding framework that mimics human gesture understanding procedures. Our framework leverages multiple Large Language Model agents to manage and synthesize gesture and context information. We validated our framework offline under two real-world scenarios: smart home control and online video streaming.
arXiv Detail & Related papers (2023-10-19T15:17:34Z)
AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis [0.0]
We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures.
arXiv Detail & Related papers (2023-05-02T07:59:38Z)
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation [11.948557523215316]
The automatic generation of such co-speech gestures is a long-standing problem in computer animation. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
arXiv Detail & Related papers (2023-01-13T00:20:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.