Steerable Scene Generation with Post Training and Inference-Time Search
- URL: http://arxiv.org/abs/2505.04831v1
- Date: Wed, 07 May 2025 22:07:42 GMT
- Title: Steerable Scene Generation with Post Training and Inference-Time Search
- Authors: Nicholas Pfaff, Hongkai Dai, Sergey Zakharov, Shun Iwase, Russ Tedrake,
- Abstract summary: Training robots in simulation requires diverse 3D scenes that reflect specific challenges of downstream tasks.<n>We generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation.<n>We release a dataset of over 44 million SE(3) scenes spanning five diverse environments.
- Score: 24.93360616245269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/
Related papers
- SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability [3.130722489512822]
SceneAware is a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy.<n>We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints.<n>Our analysis shows that the model performs consistently well across various types of pedestrian movement.
arXiv Detail & Related papers (2025-06-17T03:11:31Z) - Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving [27.088907562842902]
In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation.<n>The complexity of collecting and annotating 3D data is a bottleneck in this developments.<n>We propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models.
arXiv Detail & Related papers (2025-03-27T12:41:42Z) - Purposer: Putting Human Motion Generation in Context [30.706219830149504]
We present a novel method to generate human motion to populate 3D indoor scenes.
It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds.
arXiv Detail & Related papers (2024-04-19T15:16:04Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - NSLF-OL: Online Learning of Neural Surface Light Fields alongside
Real-time Incremental 3D Reconstruction [0.76146285961466]
The paper proposes a novel Neural Surface Light Fields model that copes with the small range of view directions while producing a good result in unseen directions.
Our model learns online the Neural Surface Light Fields (NSLF) aside from real-time 3D reconstruction with a sequential data stream as the shared input.
In addition to online training, our model also provides real-time rendering after completing the data stream for visualization.
arXiv Detail & Related papers (2023-04-29T15:41:15Z) - Diffusion-based Generation, Optimization, and Planning in 3D Scenes [89.63179422011254]
We introduce SceneDiffuser, a conditional generative model for 3D scene understanding.
SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented.
We show significant improvements compared with previous models.
arXiv Detail & Related papers (2023-01-15T03:43:45Z) - GAUDI: A Neural Architect for Immersive 3D Scene Generation [67.97817314857917]
GAUDI is a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera.
We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets.
arXiv Detail & Related papers (2022-07-27T19:10:32Z) - Towards 3D Scene Understanding by Referring Synthetic Models [65.74211112607315]
Methods typically alleviate on-extensive annotations on real scene scans.
We explore how synthetic models rely on real scene categories of synthetic features to a unified feature space.
Experiments show that our method achieves the average mAP of 46.08% on the ScanNet S3DIS dataset and 55.49% by learning datasets.
arXiv Detail & Related papers (2022-03-20T13:06:15Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data
Generation [88.04759848307687]
In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature.
We use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training.
We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods.
arXiv Detail & Related papers (2020-08-20T17:28:45Z) - Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics [33.30312206728974]
We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks.
Our approach arranges object meshes in physically realistic, dense scenes using physics simulation.
Our pipeline can be run online during training of a deep neural network.
arXiv Detail & Related papers (2020-05-12T10:11:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.