HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild
- URL: http://arxiv.org/abs/2512.02450v1
- Date: Tue, 02 Dec 2025 06:18:08 GMT
- Title: HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild
- Authors: Valentin Bieri, Marie-Julie Rakotosaona, Keisuke Tateno, Francis Engelmann, Leonidas Guibas,
- Abstract summary: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments.<n>We introduce House3D, a real world benchmark designed to support progress toward full building scale layout estimation.
- Score: 17.263404264950257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.
Related papers
- PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding [67.15800065888887]
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning.<n>We introduce an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds.<n>Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering.
arXiv Detail & Related papers (2026-01-05T18:55:45Z) - M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation [14.956470298543534]
In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed output.<n>We introduce M3D, a large-scale, multi-source dataset for 3D indoor layout generation.<n>M3D comprises 15,080 layouts and over 258k object instances, integrating real-world scans, professional CAD designs, and procedurally generated scenes.
arXiv Detail & Related papers (2025-09-28T08:16:08Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning [30.400355766831307]
3D indoor scene synthesis is vital for embodied AI and digital content creation.<n>Existing methods fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions.<n>We introduce Direct, a framework that directly generates numerical 3D layouts from text descriptions.
arXiv Detail & Related papers (2025-06-05T17:59:42Z) - SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining [100.23919762298227]
Currently, all existing methods rely on 2D or textual modalities during training or together at inference.<n>We introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates on 3DGS.<n>We propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes.
arXiv Detail & Related papers (2025-03-23T12:50:25Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - SGAligner : 3D Scene Alignment with Scene Graphs [84.01002998166145]
Building 3D scene graphs has emerged as a topic in scene representation for several embodied AI applications.
We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial.
We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios.
arXiv Detail & Related papers (2023-04-28T14:39:22Z) - Neural 3D Scene Reconstruction with the Manhattan-world Assumption [58.90559966227361]
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images.
Planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods.
The proposed method outperforms previous methods by a large margin on 3D reconstruction quality.
arXiv Detail & Related papers (2022-05-05T17:59:55Z) - Roominoes: Generating Novel 3D Floor Plans From Existing 3D Rooms [22.188206636953794]
We propose the task of generating novel 3D floor plans from existing 3D rooms.
One uses available 2D floor plans to guide selection and deformation of 3D rooms; the other learns to retrieve a set of compatible 3D rooms and combine them into novel layouts.
arXiv Detail & Related papers (2021-12-10T16:17:01Z) - Learning Compositional Shape Priors for Few-Shot 3D Reconstruction [36.40776735291117]
We show that complex encoder-decoder architectures exploit large amounts of per-category data.
We propose three ways to learn a class-specific global shape prior, directly from data.
Experiments on the popular ShapeNet dataset show that our method outperforms a zero-shot baseline by over 40%.
arXiv Detail & Related papers (2021-06-11T14:55:49Z) - Scan2Plan: Efficient Floorplan Generation from 3D Scans of Indoor Scenes [9.71137838903781]
Scan2Plan is a novel approach for accurate estimation of a floorplan from a 3D scan of the structural elements of indoor environments.
The proposed method incorporates a two-stage approach where the initial stage clusters an unordered point cloud representation of the scene.
The subsequent stage estimates a closed perimeter, parameterized by a simple polygon, for each individual room.
The final floorplan is simply an assembly of all such room perimeters in the global co-ordinate system.
arXiv Detail & Related papers (2020-03-16T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.