Related papers: Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

URL: http://arxiv.org/abs/2507.08555v1
Date: Fri, 11 Jul 2025 12:57:14 GMT
Title: Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
Authors: Enyu Liu, En Yu, Sijia Chen, Wenbing Tao,
Abstract summary: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception.<n>Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes.<n>We propose textbfDisentangling Instance and Scene Contexts (DISC) to enhance learning for both instance and scene categories.
Score: 23.76697700853566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.

Related papers

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift.<n>We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information.<n>We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z)
BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis [33.53327976669034]
We revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics.<n>We introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features.
arXiv Detail & Related papers (2025-03-16T15:13:11Z)
Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation [9.964615076037397]
Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping.<n>Previous efforts have primarily focused on pixel-level static-dynamic contexts matching.<n>This paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency framework.
arXiv Detail & Related papers (2024-12-11T02:29:51Z)
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation [40.49322398635262]
We propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features. We propose two novel loss functions: object-level distillation loss and voxel-level distillation loss.
arXiv Detail & Related papers (2024-01-04T18:39:32Z)
SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach. Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations. Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z)
A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation [49.56131393810713]
We present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs.
arXiv Detail & Related papers (2023-06-08T22:55:32Z)
Segmenting 3D Hybrid Scenes via Zero-Shot Learning [13.161136148641813]
This work is to tackle the problem of point cloud semantic segmentation for 3D hybrid scenes under the framework of zero-shot learning. We propose a network to synthesize point features for various classes of objects by leveraging the semantic features of both seen and unseen object classes, called PFNet. The proposed PFNet employs a GAN architecture to synthesize point features, where the semantic relationship between seen-class and unseen-class features is consolidated by adapting a new semantic regularizer. We introduce two benchmarks for algorithmic evaluation by re-organizing the public S3DIS and ScanNet datasets under six different data splits.
arXiv Detail & Related papers (2021-07-01T13:21:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.