SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
- URL: http://arxiv.org/abs/2404.09502v1
- Date: Mon, 15 Apr 2024 06:45:06 GMT
- Title: SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
- Authors: Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma,
- Abstract summary: We propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing.
SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline.
It also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
- Score: 15.331332063879342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
Related papers
- GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene.
We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians.
GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z) - NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth
Supervision for Indoor Multi-View 3D Detection [72.0098999512727]
NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by utilizing NeRF to enhance representation learning.
We present three corresponding solutions, including semantic enhancement, perspective-aware sampling, and ordinal depth supervision.
The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and AR KITScenes datasets.
arXiv Detail & Related papers (2024-02-22T11:48:06Z) - Fully Sparse 3D Occupancy Prediction [37.265473869812816]
Occupancy prediction plays a pivotal role in autonomous driving.
Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs.
We introduce a novel fully sparse occupancy network, termed SparseOcc.
SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries.
arXiv Detail & Related papers (2023-12-28T16:54:53Z) - PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic
Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively.
Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system.
We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z) - Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for
Efficient 3D Object Detection [19.321076175294902]
Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving.
Their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles.
We propose an adaptive inference framework called Ada3D, which focuses on exploiting the input-level spatial redundancy.
arXiv Detail & Related papers (2023-07-17T02:58:51Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - Exploiting More Information in Sparse Point Cloud for 3D Single Object
Tracking [9.693724357115762]
3D single object tracking is a key task in 3D computer vision.
The sparsity of point clouds makes it difficult to compute the similarity and locate the object.
We propose a sparse-to-dense and transformer-based framework for 3D single object tracking.
arXiv Detail & Related papers (2022-10-02T13:38:30Z) - Spatial Pruned Sparse Convolution for Efficient 3D Object Detection [41.62839541489369]
3D scenes are dominated by a large number of background points, which is redundant for the detection task that mainly needs to focus on foreground objects.
In this paper, we analyze major components of existing 3D CNNs and find that 3D CNNs ignore the redundancy of data and further amplify it in the down-sampling process, which brings a huge amount of extra and unnecessary computational overhead.
We propose a new convolution operator named spatial pruned sparse convolution (SPS-Conv), which includes two variants, spatial pruned submanifold sparse convolution (SPSS-Conv) and spatial pruned regular sparse convolution (SPRS
arXiv Detail & Related papers (2022-09-28T16:19:06Z) - From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object
Detection [101.20784125067559]
We propose a new architecture, namely Hallucinated Hollow-3D R-CNN, to address the problem of 3D object detection.
In our approach, we first extract the multi-view features by sequentially projecting the point clouds into the perspective view and the bird-eye view.
The 3D objects are detected via a box refinement module with a novel Hierarchical Voxel RoI Pooling operation.
arXiv Detail & Related papers (2021-07-30T02:00:06Z) - Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion
Forecasting with a Single Convolutional Net [93.51773847125014]
We propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor.
Our approach performs 3D convolutions across space and time over a bird's eye view representation of the 3D world.
arXiv Detail & Related papers (2020-12-22T22:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.