Holodeck: Language Guided Generation of 3D Embodied AI Environments
- URL: http://arxiv.org/abs/2312.09067v2
- Date: Mon, 22 Apr 2024 20:06:03 GMT
- Title: Holodeck: Language Guided Generation of 3D Embodied AI Environments
- Authors: Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark,
- Abstract summary: Holodeck is a system that generates 3D environments to match a user-supplied prompt fully automatedly.
We show that annotators prefer Holodeck over manually designed procedural baselines in residential scenes.
We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes without human-constructed data.
- Score: 84.16126434848829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e., GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.
Related papers
- EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents [85.77432303199176]
We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
arXiv Detail & Related papers (2026-02-26T16:53:41Z) - SceneFoundry: Generating Interactive Infinite 3D Worlds [22.60801815197924]
SceneFoundry is a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture.<n>Our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions.
arXiv Detail & Related papers (2026-01-09T14:33:10Z) - SPATIALGEN: Layout-guided 3D Indoor Scene Generation [37.30623176278608]
We present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes.<n>Given a 3D layout and a reference image, our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints.<n>We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.
arXiv Detail & Related papers (2025-09-18T14:12:32Z) - HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation [31.010614667725843]
Hierarchical Layout Generation (HLG) is a novel method for fine-grained 3D scene generation.<n>HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements.<n>We show superior performance in generating realistic indoor scenes compared to existing methods.
arXiv Detail & Related papers (2025-08-25T09:32:57Z) - From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes [30.015378490907988]
Anywhere3D-Bench is a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs.<n>We assess a range of state-of-the-art 3D visual grounding methods alongside large language models.
arXiv Detail & Related papers (2025-06-05T11:28:02Z) - HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation [36.44409268300039]
Scenethesis is a framework that integrates text-based scene planning with vision-guided layout refinement.<n>It generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
arXiv Detail & Related papers (2025-05-05T17:59:58Z) - SceneTeller: Language-to-3D Scene Generation [15.209079637302905]
Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it.
Our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices.
arXiv Detail & Related papers (2024-07-30T10:45:28Z) - BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D
Scene Generation [96.58789785954409]
We propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view map.
We produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency.
arXiv Detail & Related papers (2023-12-04T18:56:10Z) - OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space.
We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.
OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z) - UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative
Neural Feature Fields [22.180286908121946]
We propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior to guide a 3D-aware generative model.
Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky.
With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability.
arXiv Detail & Related papers (2023-03-24T17:28:07Z) - HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor
Space Using Wearable IMUs and LiDAR [51.9200422793806]
Using only body-mounted IMUs and LiDAR, HSC4D is space-free without any external devices' constraints and map-free without pre-built maps.
Relationships between humans and environments are also explored to make their interaction more realistic.
arXiv Detail & Related papers (2022-03-17T10:05:55Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.