Related papers: Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning

Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning

URL: http://arxiv.org/abs/2303.09410v1
Date: Thu, 16 Mar 2023 15:44:15 GMT
Title: Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning
Authors: Haibiao Xuan, Xiongzheng Li, Jinsong Zhang, Hongwen Zhang, Yebin Liu and Kun Li
Abstract summary: We focus on naturally and controllably generating realistic and diverse HSIs from textual descriptions. We propose Narrator, a novel relationship reasoning-based generative approach. Our experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works.
Score: 34.00107506891627
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Naturally controllable human-scene interaction (HSI) generation has an important role in various fields, such as VR/AR content creation and human-centered AI. However, existing methods are unnatural and unintuitive in their controllability, which heavily limits their application in practice. Therefore, we focus on a challenging task of naturally and controllably generating realistic and diverse HSIs from textual descriptions. From human cognition, the ideal generative model should correctly reason about spatial relationships and interactive actions. To that end, we propose Narrator, a novel relationship reasoning-based generative approach using a conditional variation autoencoder for naturally controllable generation given a 3D scene and a textual description. Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a partlevel action mechanism to represent interactions as atomic body part states. In particular, benefiting from our relationship reasoning, we further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation. Our extensive experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works. The code and dataset will be available for research purposes.

Related papers

HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z)
Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [28.825612240280822]
We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
arXiv Detail & Related papers (2025-02-20T18:17:11Z)
ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments [0.13654846342364302]
We propose ViRAC, which exploits the common-sense knowledge and reasoning capabilities of large-scale models. ViRAC produces more natural and context-aware head rotations than recent state-of-the-art techniques.
arXiv Detail & Related papers (2025-02-14T09:46:43Z)
OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains [66.62502882481373]
Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. We propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness.
arXiv Detail & Related papers (2024-11-27T10:13:35Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI [38.03745740636854]
PhyScene is a method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. We demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin.
arXiv Detail & Related papers (2024-04-15T05:29:23Z)
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction [27.10256777126629]
This paper showcases the potential of generating human-object interactions without direct training on text-interaction pair data. We introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner.
arXiv Detail & Related papers (2024-03-28T17:59:30Z)
LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment [27.38638713080283]
We introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments. We propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
arXiv Detail & Related papers (2024-03-20T05:11:10Z)
Revisit Human-Scene Interaction via Space Occupancy [55.67657438543008]
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database.
arXiv Detail & Related papers (2023-12-05T12:03:00Z)
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z)
Synthesizing Physical Character-Scene Interactions [64.26035523518846]
It is necessary to synthesize such interactions between virtual characters and their surroundings. We present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters. Our approach takes physics-based character motion generation a step closer to broad applicability.
arXiv Detail & Related papers (2023-02-02T05:21:32Z)
Compositional Human-Scene Interaction Synthesis with Semantic Control [16.93177243590465]
We aim to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications. We design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded. Inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs.
arXiv Detail & Related papers (2022-07-26T11:37:44Z)
iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects. iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.