Related papers: SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

URL: http://arxiv.org/abs/2509.20414v2
Date: Sun, 26 Oct 2025 04:10:24 GMT
Title: SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Authors: Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang,
Abstract summary: SceneWeaver is a framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement.<n>It can identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations.<n>It generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
Score: 28.12183839499528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.

Related papers

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI [67.43935343696982]
Existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes.<n>We present SAGE, an agentic framework that, given a user-specified embodied task, understands the intent and automatically generates simulation-ready environments at scale.<n>The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training.
arXiv Detail & Related papers (2026-02-10T18:59:55Z)
IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion [15.837932667195037]
IGFuse is a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans.<n>Our method constructs segmentation aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans.<n>IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex pipelines.
arXiv Detail & Related papers (2025-08-18T17:59:47Z)
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z)
Video Perception Models for 3D Scene Synthesis [109.5543506037003]
VIPScene is a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models.<n>VIPScene seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
arXiv Detail & Related papers (2025-06-25T16:40:17Z)
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation [36.44409268300039]
Scenethesis is a framework that integrates text-based scene planning with vision-guided layout refinement.<n>It generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
arXiv Detail & Related papers (2025-05-05T17:59:58Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder. We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z)
Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects [84.45345829270626]
Controllable 3D indoor scene synthesis stands at the forefront of technological progress. Current methods for scene stylization are limited to applying styles to the entire scene. We introduce a unique pipeline designed for synthesis 3D indoor scenes.
arXiv Detail & Related papers (2024-01-24T03:10:36Z)
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.