SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
- URL: http://arxiv.org/abs/2602.09153v1
- Date: Mon, 09 Feb 2026 19:56:04 GMT
- Title: SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
- Authors: Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake,
- Abstract summary: SceneSmith builds environments from architectural layout to natural furniture population.<n>SceneSmith generates more objects than prior methods, with 2% interobject collisions and 96% of objects remaining stable under physics simulation.<n>SceneSmith environments can be used in an end-to-end pipeline for automatic policy evaluation.
- Score: 19.995619927680476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
Related papers
- MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation [56.30931340537373]
MolmoSpaces is a fully open ecosystem to support benchmarking of robot policies.<n>MolmoSpaces consists of over 230k diverse indoor environments.<n>MolmoSpaces-Bench is a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects.
arXiv Detail & Related papers (2026-02-11T20:16:31Z) - SAGE: Scalable Agentic 3D Scene Generation for Embodied AI [67.43935343696982]
Existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes.<n>We present SAGE, an agentic framework that, given a user-specified embodied task, understands the intent and automatically generates simulation-ready environments at scale.<n>The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training.
arXiv Detail & Related papers (2026-02-10T18:59:55Z) - Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions [27.247431258140463]
We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos.<n>We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing.
arXiv Detail & Related papers (2025-11-06T18:52:08Z) - URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots [25.650235551519952]
We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments.
We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances.
Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning.
arXiv Detail & Related papers (2024-06-04T17:41:31Z) - RPMArt: Towards Robust Perception and Manipulation for Articulated Objects [56.73978941406907]
We propose a framework towards Robust Perception and Manipulation for Articulated Objects ( RPMArt)
RPMArt learns to estimate the articulation parameters and manipulate the articulation part from the noisy point cloud.
We introduce an articulation-aware classification scheme to enhance its ability for sim-to-real transfer.
arXiv Detail & Related papers (2024-03-24T05:55:39Z) - Ditto in the House: Building Articulation Models of Indoor Scenes
through Interactive Perception [31.009703947432026]
This work explores building articulation models of indoor scenes through a robot's purposeful interactions.
We introduce an interactive perception approach to this task.
We demonstrate the effectiveness of our approach in both simulation and real-world scenes.
arXiv Detail & Related papers (2023-02-02T18:22:00Z) - Phone2Proc: Bringing Robust Robots Into Our Chaotic World [50.51598304564075]
Phone2Proc is a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes.
The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan.
Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance.
arXiv Detail & Related papers (2022-12-08T18:52:27Z) - Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data
Generation [88.04759848307687]
In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature.
We use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training.
We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods.
arXiv Detail & Related papers (2020-08-20T17:28:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.