Related papers: Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

URL: http://arxiv.org/abs/2511.13031v2
Date: Tue, 18 Nov 2025 03:26:17 GMT
Title: Towards 3D Object-Centric Feature Learning for Semantic Scene Completion
Authors: Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang,
Abstract summary: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving.<n>We propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances.<n>We show that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.
Score: 18.41627244498394
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

Related papers

Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion [4.679314646805623]
3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects.<n>Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views.<n>We propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level.
arXiv Detail & Related papers (2025-12-07T15:15:52Z)
MT-Depth: Multi-task Instance feature analysis for the Depth Completion [0.0]
We introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions.<n>Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head.<n>We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower Root Mean Squared Error (RMSE) compared to both a U-Net-only baseline and previous semantic-guided methods.
arXiv Detail & Related papers (2025-12-04T12:17:33Z)
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z)
S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.<n>Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.<n>Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.<n>Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z)
3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding. This paper introduces a novel approach to instance segmentation and tracking in first-person video. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z)
Monocular Per-Object Distance Estimation with Masked Object Modeling [33.59920084936913]
Our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks.<n>Our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques.<n>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOT Synth datasets.
arXiv Detail & Related papers (2024-01-06T10:56:36Z)
SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach. Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations. Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z)
Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters. We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters. The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z)
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z)
Spatial Semantic Embedding Network: Fast 3D Instance Segmentation with Deep Metric Learning [5.699350798684963]
We propose a simple, yet efficient algorithm for 3D instance segmentation using deep metric learning. For high-level intelligent tasks from a large scale scene, 3D instance segmentation recognizes individual instances of objects. We demonstrate the state-of-the-art performance of our algorithm in the ScanNet 3D instance segmentation benchmark on AP score.
arXiv Detail & Related papers (2020-07-07T02:17:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.