Open-Vocabulary Functional 3D Human-Scene Interaction Generation
- URL: http://arxiv.org/abs/2601.20835v2
- Date: Fri, 30 Jan 2026 16:39:55 GMT
- Title: Open-Vocabulary Functional 3D Human-Scene Interaction Generation
- Authors: Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron, Michael J. Black, Yan Zhang,
- Abstract summary: FunHSI is a training-free framework that enables functionally correct human-scene interactions from open-vocabulary task prompts.<n>We show that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
- Score: 45.61489012931424
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
Related papers
- Language-guided 3D scene synthesis for fine-grained functionality understanding [64.148891566272]
We introduce SynthFun3D, the first method for task-based 3D scene synthesis.<n>It generates a 3D indoor environment using a furniture asset database with part-level annotation.<n>It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element.
arXiv Detail & Related papers (2025-11-28T14:40:03Z) - Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces [113.91791599146786]
We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images.<n>Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships.<n>We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs.
arXiv Detail & Related papers (2025-03-24T22:53:19Z) - FunHOI: Annotation-Free 3D Hand-Object Interaction Generation via Functional Text Guidanc [9.630837159704004]
Hand-object interaction (HOI) is the fundamental link between human and environment.<n>Despite advances in AI and robotics, capturing the semantics of functional grasping tasks remains a considerable challenge.<n>We propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text.
arXiv Detail & Related papers (2025-02-28T07:42:54Z) - Functional 3D Scene Synthesis through Human-Scene Optimization [30.910671968876024]
Our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans.<n>If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-05T04:00:24Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - GenZI: Zero-Shot 3D Human-Scene Interaction Generation [39.9039943099911]
We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions.
Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions.
In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data.
arXiv Detail & Related papers (2023-11-29T15:40:11Z) - Synthesizing Diverse Human Motions in 3D Indoor Scenes [16.948649870341782]
We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner.
Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with.
We propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously.
arXiv Detail & Related papers (2023-05-21T09:22:24Z) - HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes [54.61610144668777]
We present a novel scene-and-language conditioned generative model that can produce 3D human motions in 3D scenes.
Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.
arXiv Detail & Related papers (2022-10-18T10:14:11Z) - Reconstructing Action-Conditioned Human-Object Interactions Using
Commonsense Knowledge Priors [42.17542596399014]
We present a method for inferring diverse 3D models of human-object interactions from images.
Our method extracts high-level commonsense knowledge from large language models.
We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset.
arXiv Detail & Related papers (2022-09-06T13:32:55Z) - Compositional Human-Scene Interaction Synthesis with Semantic Control [16.93177243590465]
We aim to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications.
We design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded.
Inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs.
arXiv Detail & Related papers (2022-07-26T11:37:44Z) - Fixing Malfunctional Objects With Learned Physical Simulation and
Functional Prediction [158.74130075865835]
Given a malfunctional 3D object, humans can perform mental simulations to reason about its functionality and figure out how to fix it.
To mimic humans' mental simulation process, we present FixNet, a novel framework that seamlessly incorporates perception and physical dynamics.
arXiv Detail & Related papers (2022-05-05T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.