Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- URL: http://arxiv.org/abs/2309.07918v5
- Date: Tue, 05 Nov 2024 02:17:22 GMT
- Title: Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- Authors: Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang,
- Abstract summary: Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality.
This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands.
- Score: 61.87652569413429
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .
Related papers
- TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization [41.224062790263375]
TokenHSI is a transformer-based policy capable of multi-skill unification and flexible adaptation.
Key insight is to model the humanoid proprioception as a separate shared token.
Our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios.
arXiv Detail & Related papers (2025-03-25T17:57:46Z) - Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics [30.43930233035367]
This paper introduces a unified Human-Object Interaction framework.
It provides unified control over interactions with static scenes and dynamic objects using language commands.
Our framework supports long-horizon interactions among dynamic, articulated, and static objects.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks.
RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens.
This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z) - SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation [38.96874874208242]
We introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy.
Specifically, we employ Large Language Models with Retrieval-Augmented Generation to generate coherent and diverse long-form scripts.
A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues.
arXiv Detail & Related papers (2024-11-29T18:36:15Z) - Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Revisit Human-Scene Interaction via Space Occupancy [55.67657438543008]
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks.
In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective.
By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database.
arXiv Detail & Related papers (2023-12-05T12:03:00Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Breaking Down the Task: A Unit-Grained Hybrid Training Framework for
Vision and Language Decision Making [19.87916700767421]
Vision language decision making (VLDM) is a challenging multimodal task.
From an environment perspective, we find that task episodes can be divided into fine-grained textitunits
We propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias.
arXiv Detail & Related papers (2023-07-16T11:54:16Z) - RoCo: Dialectic Multi-Robot Collaboration with Large Language Models [13.260289557301688]
We propose a novel approach to multi-robot collaboration that harnesses the power of pre-trained large language models (LLMs)
We show RoCo easily incorporates human-in-the-loop, where a user can communicate and collaborate with a robot agent to complete tasks together.
arXiv Detail & Related papers (2023-07-10T17:52:01Z) - A Unified Architecture for Dynamic Role Allocation and Collaborative
Task Planning in Mixed Human-Robot Teams [0.0]
We present a novel architecture for dynamic role allocation and collaborative task planning in a mixed human-robot team of arbitrary size.
The architecture capitalizes on a centralized reactive and modular task-agnostic planning method based on Behavior Trees (BTs)
Different metrics used as MILP cost allow the architecture to favor various aspects of the collaboration.
arXiv Detail & Related papers (2023-01-19T12:30:56Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.