Related papers: RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

URL: http://arxiv.org/abs/2512.11234v1
Date: Fri, 12 Dec 2025 02:33:09 GMT
Title: RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
Authors: Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li,
Abstract summary: RoomPilot is a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (I) for indoor structured scene generation.<n>In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors.
Score: 8.822704029209593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

Related papers

PAct: Part-Decomposed Single-View Articulated Object Generation [45.04652409374895]
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR.<n>We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning.<n>Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues.
arXiv Detail & Related papers (2026-02-16T17:45:44Z)
SceneFoundry: Generating Interactive Infinite 3D Worlds [22.60801815197924]
SceneFoundry is a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture.<n>Our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions.
arXiv Detail & Related papers (2026-01-09T14:33:10Z)
REACT3D: Recovering Articulations for Interactive Physical 3D Scenes [96.27769519526426]
REACT3D is a framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry.<n>We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.
arXiv Detail & Related papers (2025-10-13T12:37:59Z)
RoomCraft: Controllable and Complete 3D Indoor Scene Generation [51.19602078504066]
RoomCraft is a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes.<n>Our approach combines a scene generation pipeline with a constraint-driven optimization framework.<n>RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts.
arXiv Detail & Related papers (2025-06-27T15:03:17Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation [17.34617771579733]
We present ASSIST, an object-wise neural radiance field as a panoptic representation for compositional and realistic simulation. A novel scene node data structure that stores the information of each object in a unified fashion allows online interaction in both intra- and cross-scene settings.
arXiv Detail & Related papers (2023-11-10T17:56:43Z)
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation [38.08152990071453]
We introduce a Multi-granularity Interaction Simulation (MIS) approach to open up a promising direction for unsupervised interactive segmentation. Our MIS significantly outperforms non-deep learning unsupervised methods and is even comparable with some previous deep-supervised methods without any annotation.
arXiv Detail & Related papers (2023-03-23T16:19:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.