Related papers: Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

URL: http://arxiv.org/abs/2507.06230v2
Date: Fri, 25 Jul 2025 14:31:34 GMT
Title: Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
Authors: Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers,
Abstract summary: Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner.<n>In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy.
Score: 86.34232220368855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

Related papers

Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion [0.8669877024051931]
Monocular Indoor Semantic Scene Completion aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene.<n>We introduce an innovative approach that leverages novel view synthesis and multiview fusion.<n>We demonstrate IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset.
arXiv Detail & Related papers (2025-03-07T02:09:38Z)
MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D. Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z)
Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space [10.49905491984899]
We redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding.<n>We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels.<n>We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy.
arXiv Detail & Related papers (2024-08-14T09:50:02Z)
S4C: Self-Supervised Semantic Scene Completion with Neural Fields [54.35865716337547]
3D semantic scene understanding is a fundamental challenge in computer vision. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. Our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data.
arXiv Detail & Related papers (2023-10-11T14:19:05Z)
SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z)
Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z)
Neural Groundplans: Persistent Neural Scene Representations from a Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation. We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z)
Self-Supervised Image Representation Learning with Geometric Set Consistency [50.12720780102395]
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency. Specifically, we introduce 3D geometric consistency into a contrastive learning framework to enforce the feature consistency within image views.
arXiv Detail & Related papers (2022-03-29T08:57:33Z)
Depth Based Semantic Scene Completion with Position Importance Aware Loss [52.06051681324545]
PALNet is a novel hybrid network for semantic scene completion. It extracts both 2D and 3D features from multi-stages using fine-grained depth information. It is beneficial for recovering key details like the boundaries of objects and the corners of the scene.
arXiv Detail & Related papers (2020-01-29T07:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.