RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
- URL: http://arxiv.org/abs/2512.11234v1
- Date: Fri, 12 Dec 2025 02:33:09 GMT
- Title: RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
- Authors: Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li,
- Abstract summary: RoomPilot is a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (I) for indoor structured scene generation.<n>In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors.
- Score: 8.822704029209593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
Related papers
- PAct: Part-Decomposed Single-View Articulated Object Generation [45.04652409374895]
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR.<n>We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning.<n>Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues.
arXiv Detail & Related papers (2026-02-16T17:45:44Z) - SceneFoundry: Generating Interactive Infinite 3D Worlds [22.60801815197924]
SceneFoundry is a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture.<n>Our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions.
arXiv Detail & Related papers (2026-01-09T14:33:10Z) - REACT3D: Recovering Articulations for Interactive Physical 3D Scenes [96.27769519526426]
REACT3D is a framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry.<n>We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.
arXiv Detail & Related papers (2025-10-13T12:37:59Z) - RoomCraft: Controllable and Complete 3D Indoor Scene Generation [51.19602078504066]
RoomCraft is a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes.<n>Our approach combines a scene generation pipeline with a constraint-driven optimization framework.<n>RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts.
arXiv Detail & Related papers (2025-06-27T15:03:17Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor
Simulation [17.34617771579733]
We present ASSIST, an object-wise neural radiance field as a panoptic representation for compositional and realistic simulation.
A novel scene node data structure that stores the information of each object in a unified fashion allows online interaction in both intra- and cross-scene settings.
arXiv Detail & Related papers (2023-11-10T17:56:43Z) - Multi-granularity Interaction Simulation for Unsupervised Interactive
Segmentation [38.08152990071453]
We introduce a Multi-granularity Interaction Simulation (MIS) approach to open up a promising direction for unsupervised interactive segmentation.
Our MIS significantly outperforms non-deep learning unsupervised methods and is even comparable with some previous deep-supervised methods without any annotation.
arXiv Detail & Related papers (2023-03-23T16:19:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.