Grounded Gesture Generation: Language, Motion, and Space
- URL: http://arxiv.org/abs/2507.04522v1
- Date: Sun, 06 Jul 2025 20:19:34 GMT
- Title: Grounded Gesture Generation: Language, Motion, and Space
- Authors: Anna Deichler, Jim O'Regan, Teo Guichoux, David Johansson, Jonas Beskow,
- Abstract summary: We introduce a multimodal dataset and framework for grounded gesture generation.<n>We provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format.<n>Our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction.
- Score: 3.4973270688542626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction. Project page: https://groundedgestures.github.io/
Related papers
- HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes [26.71077287710599]
We propose UniHM, a unified motion language model that leverages diffusion-based generation for scene-aware human motion.<n>UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes.<n>Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of
arXiv Detail & Related papers (2025-05-19T07:02:12Z) - Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy [30.43930233035367]
We introduce the first unified physics-based HO framework that leverages Vision-Language Models (VLMs)<n>We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-temporal bipartite motion representation that automatically constructs goal states and reward functions for reinforcement learning.<n>To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures [8.50717565369252]
HoleGest is a novel neural network framework for automatic generation of high-quality, expressive co-speech gestures.<n>Our system learns a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements.<n>Our model achieves a level of realism close to the ground truth, providing an immersive user experience.
arXiv Detail & Related papers (2025-03-17T14:42:31Z) - Towards Open Domain Text-Driven Synthesis of Multi-Person Motions [36.737740727883924]
We curate human pose and motion datasets by estimating pose information from large-scale image and video datasets.
Our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
arXiv Detail & Related papers (2024-05-28T18:00:06Z) - Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions.
We propose a new approach that decomposes the complex problem into two more manageable sub-problems.
For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation.
By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals.
Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction [61.833152949826946]
We propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR.
GraMMaR learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence.
It is trained to explicitly promote consistency between the motion and distance change towards the ground.
arXiv Detail & Related papers (2023-06-29T07:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.