Related papers: Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

URL: http://arxiv.org/abs/2112.12925v2
Date: Mon, 20 Mar 2023 12:30:36 GMT
Title: Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective
Authors: Xiaokang Chen, Jiaxiang Tang, Jingbo Wang, Gang Zeng
Abstract summary: We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes. We propose our novel point-voxel aggregation network for this task. Our model surpasses state-of-the-arts computation on two benchmarks by a large margin, with only depth images as the input.
Score: 21.92736190195887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes, in this paper. A number of methods for this task are always based on voxelized scene representations for keeping local scene structure. However, due to the existence of visible empty voxels, these methods always suffer from heavy computation redundancy when the network goes deeper, and thus limit the completion quality. To address this dilemma, we propose our novel point-voxel aggregation network for this task. Firstly, we transfer the voxelized scenes to point clouds by removing these visible empty voxels and adopt a deep point stream to capture semantic information from the scene efficiently. Meanwhile, a light-weight voxel stream containing only two 3D convolution layers preserves local structures of the voxelized scenes. Furthermore, we design an anisotropic voxel aggregation operator to fuse the structure details from the voxel stream into the point stream, and a semantic-aware propagation module to enhance the up-sampling process in the point stream by semantic labels. We demonstrate that our model surpasses state-of-the-arts on two benchmarks by a large margin, with only depth images as the input.

Related papers

VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection [67.09867723723934]
3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment.<n>With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel.<n>We propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection.
arXiv Detail & Related papers (2025-06-05T04:31:55Z)
VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation [0.0]
Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data.
arXiv Detail & Related papers (2025-03-27T07:07:11Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space. Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z)
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion [129.5975573092919]
VoxFormer is a Transformer-based semantic scene completion framework. It can output complete 3D semantics from only 2D images. Our framework outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics.
arXiv Detail & Related papers (2023-02-23T18:59:36Z)
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds [16.69887974230884]
Transformer has demonstrated promising performance in many 2D vision tasks. It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
arXiv Detail & Related papers (2022-03-19T12:31:46Z)
Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds. Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z)
A Real-Time Online Learning Framework for Joint 3D Reconstruction and Semantic Segmentation of Indoor Scenes [87.74952229507096]
This paper presents a real-time online vision framework to jointly recover an indoor scene's 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed neural network learns to fuse the depth over frames with suitable semantic labels in the scene space.
arXiv Detail & Related papers (2021-08-11T14:29:01Z)
Semantic Scene Completion using Local Deep Implicit Functions on LiDAR Data [4.355440821669468]
We propose a scene segmentation network based on local Deep Implicit Functions as a novel learning-based method for scene completion. We show that this continuous representation is suitable to encode geometric and semantic properties of extensive outdoor scenes without the need for spatial discretization. Our experiments verify that our method generates a powerful representation that can be decoded into a dense 3D description of a given scene.
arXiv Detail & Related papers (2020-11-18T07:39:13Z)
Multi view stereo with semantic priors [3.756550107432323]
We aim to support the standard dense 3D reconstruction of scenes as implemented in the open source library OpenMVS by using semantic priors. We impose extra semantic constraints in order to remove possible errors and selectively obtain segmented point clouds per label.
arXiv Detail & Related papers (2020-07-05T11:30:29Z)
3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation. Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.