LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control
- URL: http://arxiv.org/abs/2406.16038v1
- Date: Sun, 23 Jun 2024 07:26:13 GMT
- Title: LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control
- Authors: Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Bin Zhao, Dong Wang, Xuelong Li,
- Abstract summary: We extend the interactive object reconstruction from single object level to complex scene level.
We propose LiveScene, the first scene-level language-embedded interactive neural radiance field.
LiveScene efficiently reconstructs and controls multiple interactive objects in complex scenes.
- Score: 45.1230495980299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to advance the progress of physical world interactive scene reconstruction by extending the interactive object reconstruction from single object level to complex scene level. To this end, we first construct one simulated and one real scene-level physical interaction dataset containing 28 scenes with multiple interactive objects per scene. Furthermore, to accurately model the interactive motions of multiple objects in complex scenes, we propose LiveScene, the first scene-level language-embedded interactive neural radiance field that efficiently reconstructs and controls multiple interactive objects in complex scenes. LiveScene introduces an efficient factorization that decomposes the interactive scene into multiple local deformable fields to separately reconstruct individual interactive objects, achieving the first accurate and independent control on multiple interactive objects in a complex scene. Moreover, we introduce an interaction-aware language embedding method that generates varying language embeddings to localize individual interactive objects under different interactive states, enabling arbitrary control of interactive objects using natural language. Finally, we evaluate LiveScene on the constructed datasets OminiSim and InterReal with various simulated and real-world complex scenes. Extensive experiment results demonstrate that the proposed approach achieves SOTA novel view synthesis and language grounding performance, surpassing existing methods by +9.89, +1.30, and +1.99 in PSNR on CoNeRF Synthetic, OminiSim #chanllenging, and InterReal #chanllenging datasets, and +65.12 of mIOU on OminiSim, respectively. Project page: \href{https://livescenes.github.io}{https://livescenes.github.io}.
Related papers
- HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects.
HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Natural-language-driven Simulation Benchmark and Copilot for Efficient
Production of Object Interactions in Virtual Road Scenes [8.303084278117861]
We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes.
We collect the Language-to-Interaction(L2I) benchmark dataset with 120,000 natural-language descriptions of object interactions in 6 common types of road topologies.
As a methodology contribution, we design SimCopilot to translate the interaction descriptions to the renderable code.
arXiv Detail & Related papers (2023-12-07T02:55:46Z) - ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor
Simulation [17.34617771579733]
We present ASSIST, an object-wise neural radiance field as a panoptic representation for compositional and realistic simulation.
A novel scene node data structure that stores the information of each object in a unified fashion allows online interaction in both intra- and cross-scene settings.
arXiv Detail & Related papers (2023-11-10T17:56:43Z) - Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in
Clutter [14.489086924126253]
This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes.
Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes.
We propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn synthesis grasp directly from image-text pairs.
arXiv Detail & Related papers (2023-11-09T22:55:10Z) - Unified Human-Scene Interaction via Prompted Chain-of-Contacts [61.87652569413429]
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality.
This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands.
arXiv Detail & Related papers (2023-09-14T17:59:49Z) - Synthesizing Physical Character-Scene Interactions [64.26035523518846]
It is necessary to synthesize such interactions between virtual characters and their surroundings.
We present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters.
Our approach takes physics-based character motion generation a step closer to broad applicability.
arXiv Detail & Related papers (2023-02-02T05:21:32Z) - Compositional Human-Scene Interaction Synthesis with Semantic Control [16.93177243590465]
We aim to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications.
We design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded.
Inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs.
arXiv Detail & Related papers (2022-07-26T11:37:44Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.