SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
- URL: http://arxiv.org/abs/2602.10116v1
- Date: Tue, 10 Feb 2026 18:59:55 GMT
- Title: SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
- Authors: Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei,
- Abstract summary: Existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes.<n>We present SAGE, an agentic framework that, given a user-specified embodied task, understands the intent and automatically generates simulation-ready environments at scale.<n>The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training.
- Score: 67.43935343696982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.
Related papers
- Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z) - Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions [27.247431258140463]
We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos.<n>We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing.
arXiv Detail & Related papers (2025-11-06T18:52:08Z) - SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent [28.12183839499528]
SceneWeaver is a framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement.<n>It can identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations.<n>It generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
arXiv Detail & Related papers (2025-09-24T09:06:41Z) - ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z) - InfGen: Scenario Generation as Next Token Group Prediction [49.54222089551598]
InfGen is a scenario generation framework that outputs agent states and trajectories in an autoregressive manner.<n>Experiments demonstrate that InfGen produces realistic, diverse, and adaptive traffic behaviors.
arXiv Detail & Related papers (2025-06-29T16:18:32Z) - Steerable Scene Generation with Post Training and Inference-Time Search [21.854690970995648]
Training robots in simulation requires diverse 3D scenes that reflect specific challenges of downstream tasks.<n>We generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation.<n>We release a dataset of over 44 million SE(3) scenes spanning five diverse environments.
arXiv Detail & Related papers (2025-05-07T22:07:42Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z) - SceneGen: Generative Contextual Scene Augmentation using Scene Graph
Priors [3.1969855247377827]
We introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes.
SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content.
We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room.
To demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.
arXiv Detail & Related papers (2020-09-25T18:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.