VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
- URL: http://arxiv.org/abs/2506.04623v1
- Date: Thu, 05 Jun 2025 04:31:55 GMT
- Title: VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
- Authors: Wuyang Li, Zhu Yu, Alexandre Alahi,
- Abstract summary: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment.<n>With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel.<n>We propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection.
- Score: 67.09867723723934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.
Related papers
- OccLE: Label-Efficient 3D Semantic Occupancy Prediction [48.50138308129873]
OccLE is a Label-Efficient 3D Semantic Occupancy Prediction.<n>It takes images and LiDAR as inputs and maintains high performance with limited voxel annotations.<n>Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations.
arXiv Detail & Related papers (2025-05-27T01:41:28Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding [56.079013202051094]
We present SegVG, a novel method transfers the box-level annotation as signals to provide an additional pixel-level supervision for Visual Grounding.
This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation.
arXiv Detail & Related papers (2024-07-03T15:30:45Z) - 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation [20.7179907935644]
3D-AVS is a method for Auto-Vocabulary of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime.<n>3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary.<n>Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions.
arXiv Detail & Related papers (2024-06-13T13:59:47Z) - Collaborative Propagation on Multiple Instance Graphs for 3D Instance
Segmentation with Single-point Supervision [63.429704654271475]
We propose a novel weakly supervised method RWSeg that only requires labeling one object with one point.
With these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information.
Specifically, we propose a Cross-graph Competing Random Walks (CRW) algorithm that encourages competition among different instance graphs.
arXiv Detail & Related papers (2022-08-10T02:14:39Z) - Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from
Point Clouds [16.69887974230884]
Transformer has demonstrated promising performance in many 2D vision tasks.
It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space.
Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation.
We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
arXiv Detail & Related papers (2022-03-19T12:31:46Z) - SASA: Semantics-Augmented Set Abstraction for Point-based 3D Object
Detection [78.90102636266276]
We propose a novel set abstraction method named Semantics-Augmented Set Abstraction (SASA)
Based on the estimated point-wise foreground scores, we then propose a semantics-guided point sampling algorithm to help retain more important foreground points during down-sampling.
In practice, SASA shows to be effective in identifying valuable points related to foreground objects and improving feature learning for point-based 3D detection.
arXiv Detail & Related papers (2022-01-06T08:54:47Z) - Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel
Perspective [21.92736190195887]
We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes.
We propose our novel point-voxel aggregation network for this task.
Our model surpasses state-of-the-arts computation on two benchmarks by a large margin, with only depth images as the input.
arXiv Detail & Related papers (2021-12-24T03:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.