Holodeck: Language Guided Generation of 3D Embodied AI Environments
- URL: http://arxiv.org/abs/2312.09067v2
- Date: Mon, 22 Apr 2024 20:06:03 GMT
- Title: Holodeck: Language Guided Generation of 3D Embodied AI Environments
- Authors: Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark,
- Abstract summary: Holodeck is a system that generates 3D environments to match a user-supplied prompt fully automatedly.
We show that annotators prefer Holodeck over manually designed procedural baselines in residential scenes.
We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes without human-constructed data.
- Score: 84.16126434848829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e., GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.
Related papers
- SceneTeller: Language-to-3D Scene Generation [15.209079637302905]
Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it.
Our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices.
arXiv Detail & Related papers (2024-07-30T10:45:28Z) - BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D
Scene Generation [96.58789785954409]
We propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view map.
We produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency.
arXiv Detail & Related papers (2023-12-04T18:56:10Z) - OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space.
We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.
OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z) - UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative
Neural Feature Fields [22.180286908121946]
We propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior to guide a 3D-aware generative model.
Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky.
With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability.
arXiv Detail & Related papers (2023-03-24T17:28:07Z) - HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor
Space Using Wearable IMUs and LiDAR [51.9200422793806]
Using only body-mounted IMUs and LiDAR, HSC4D is space-free without any external devices' constraints and map-free without pre-built maps.
Relationships between humans and environments are also explored to make their interaction more realistic.
arXiv Detail & Related papers (2022-03-17T10:05:55Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.