Generating Human Interaction Motions in Scenes with Text Control
- URL: http://arxiv.org/abs/2404.10685v1
- Date: Tue, 16 Apr 2024 16:04:38 GMT
- Title: Generating Human Interaction Motions in Scenes with Text Control
- Authors: Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe,
- Abstract summary: We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
- Score: 66.74298145999909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.
Related papers
- KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision.
Traditional approaches often rely on holistic action descriptions for motion synthesis.
We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z) - Physics-based Scene Layout Generation from Human Motion [21.939444709132395]
We present a physics-based approach that simultaneously optimize a scene layout generator and simulates a moving human in a physics simulator.
We use reinforcement learning to perform a dual-optimization of both the character motion imitation controller and the scene layout generator.
We evaluate our method using motions from SAMP and PROX, and demonstrate physically plausible scene layout reconstruction compared with the previous kinematics-based method.
arXiv Detail & Related papers (2024-05-21T02:36:37Z) - Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions.
We propose a new approach that decomposes the complex problem into two more manageable sub-problems.
For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer [27.278989809466392]
We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene.
We leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors.
arXiv Detail & Related papers (2023-11-28T18:03:27Z) - Synthesizing Diverse Human Motions in 3D Indoor Scenes [16.948649870341782]
We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner.
Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with.
We propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously.
arXiv Detail & Related papers (2023-05-21T09:22:24Z) - CIRCLE: Capture In Rich Contextual Environments [69.97976304918149]
We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world.
We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes.
We use this dataset to train a model that generates human motion conditioned on scene information.
arXiv Detail & Related papers (2023-03-31T09:18:12Z) - Synthesizing Physical Character-Scene Interactions [64.26035523518846]
It is necessary to synthesize such interactions between virtual characters and their surroundings.
We present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters.
Our approach takes physics-based character motion generation a step closer to broad applicability.
arXiv Detail & Related papers (2023-02-02T05:21:32Z) - IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object
Interactions [69.95820880360345]
We present the first framework to synthesize the full-body motion of virtual human characters with 3D objects placed within their reach.
Our system takes as input textual instructions specifying the objects and the associated intentions of the virtual characters.
We show that our synthesized full-body motions appear more realistic to the participants in more than 80% of scenarios.
arXiv Detail & Related papers (2022-12-14T23:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.