Towards 3D Object-Centric Feature Learning for Semantic Scene Completion
- URL: http://arxiv.org/abs/2511.13031v2
- Date: Tue, 18 Nov 2025 03:26:17 GMT
- Title: Towards 3D Object-Centric Feature Learning for Semantic Scene Completion
- Authors: Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang,
- Abstract summary: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving.<n>We propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances.<n>We show that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.
- Score: 18.41627244498394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.
Related papers
- Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion [4.679314646805623]
3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects.<n>Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views.<n>We propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level.
arXiv Detail & Related papers (2025-12-07T15:15:52Z) - MT-Depth: Multi-task Instance feature analysis for the Depth Completion [0.0]
We introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions.<n>Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head.<n>We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower Root Mean Squared Error (RMSE) compared to both a U-Net-only baseline and previous semantic-guided methods.
arXiv Detail & Related papers (2025-12-04T12:17:33Z) - GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.<n>Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.<n>Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.<n>Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z) - 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Monocular Per-Object Distance Estimation with Masked Object Modeling [33.59920084936913]
Our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks.<n>Our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques.<n>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOT Synth datasets.
arXiv Detail & Related papers (2024-01-06T10:56:36Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters.
We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters.
The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Spatial Semantic Embedding Network: Fast 3D Instance Segmentation with
Deep Metric Learning [5.699350798684963]
We propose a simple, yet efficient algorithm for 3D instance segmentation using deep metric learning.
For high-level intelligent tasks from a large scale scene, 3D instance segmentation recognizes individual instances of objects.
We demonstrate the state-of-the-art performance of our algorithm in the ScanNet 3D instance segmentation benchmark on AP score.
arXiv Detail & Related papers (2020-07-07T02:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.