Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
- URL: http://arxiv.org/abs/2507.08555v1
- Date: Fri, 11 Jul 2025 12:57:14 GMT
- Title: Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
- Authors: Enyu Liu, En Yu, Sijia Chen, Wenbing Tao,
- Abstract summary: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception.<n>Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes.<n>We propose textbfDisentangling Instance and Scene Contexts (DISC) to enhance learning for both instance and scene categories.
- Score: 23.76697700853566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
Related papers
- Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift.<n>We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information.<n>We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z) - BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis [33.53327976669034]
We revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics.<n>We introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features.
arXiv Detail & Related papers (2025-03-16T15:13:11Z) - Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation [9.964615076037397]
Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping.<n>Previous efforts have primarily focused on pixel-level static-dynamic contexts matching.<n>This paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency framework.
arXiv Detail & Related papers (2024-12-11T02:29:51Z) - Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation [40.49322398635262]
We propose the first method to tackle 3D open-vocabulary panoptic segmentation.
Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features.
We propose two novel loss functions: object-level distillation loss and voxel-level distillation loss.
arXiv Detail & Related papers (2024-01-04T18:39:32Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with
Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner.
This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and
Motion Estimation [49.56131393810713]
We present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner.
Our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs.
arXiv Detail & Related papers (2023-06-08T22:55:32Z) - Segmenting 3D Hybrid Scenes via Zero-Shot Learning [13.161136148641813]
This work is to tackle the problem of point cloud semantic segmentation for 3D hybrid scenes under the framework of zero-shot learning.
We propose a network to synthesize point features for various classes of objects by leveraging the semantic features of both seen and unseen object classes, called PFNet.
The proposed PFNet employs a GAN architecture to synthesize point features, where the semantic relationship between seen-class and unseen-class features is consolidated by adapting a new semantic regularizer.
We introduce two benchmarks for algorithmic evaluation by re-organizing the public S3DIS and ScanNet datasets under six different data splits.
arXiv Detail & Related papers (2021-07-01T13:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.