Related papers: SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

URL: http://arxiv.org/abs/2601.07218v1
Date: Mon, 12 Jan 2026 05:24:27 GMT
Title: SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis
Authors: Jeongjun Choi, Yeonsoo Park, H. Jin Kim,
Abstract summary: SceneNAT is a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions.<n> SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes.<n>Experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines.
Score: 20.364473705766326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Related papers

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z)
CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D [0.0]
3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation.<n>Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs)<n>We leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks.
arXiv Detail & Related papers (2025-09-29T09:43:00Z)
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Scenes [30.417675568919552]
Large-scale 3D semantic scene generation has predominantly relied on voxel-based representations.<n> primitives represent semantic entities using compact, coarse 3D structures that are easy to manipulate and compose.<n>PrITTI is a latent diffusion-based framework that leverages primitives as the main foundational elements for generating compositional, controllable, and editable scene layouts.
arXiv Detail & Related papers (2025-06-23T20:47:18Z)
CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design [35.11283253765395]
We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene.<n>Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes.
arXiv Detail & Related papers (2025-04-28T04:35:04Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes.<n>We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z)
DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling [23.06464506261766]
We present DreamScape, a method for generating 3D scenes from text.<n>We use 3D Gaussian Guide that encodes semantic primitives, spatial transformations and relationships from text using LLMs.<n>DreamScape achieves state-of-the-art performance, enabling high-fidelity, controllable 3D scene generation.
arXiv Detail & Related papers (2024-04-14T12:13:07Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)
Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes. Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.