COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2312.01919v2
- Date: Thu, 11 Apr 2024 10:38:33 GMT
- Title: COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
- Authors: Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, Yuan Xie,
- Abstract summary: The autonomous driving community has shown significant interest in 3D occupancy prediction.
We propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder.
COTR outperforms baselines with a relative improvement of 8%-15%.
- Score: 60.87168562615171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.
Related papers
- 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation [16.69186493462387]
We introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context.
In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions.
We show that ProtoOcc achieves competitive performance against the baselines even with 75% reduced voxel resolution.
arXiv Detail & Related papers (2025-03-19T13:14:57Z) - CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer [42.68740105997167]
We introduce two frameworks for 3D object detection with minimal hand-crafted design.
Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal.
Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information.
arXiv Detail & Related papers (2024-06-12T12:40:28Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting [68.94594215660473]
We propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS)
We exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms.
Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality.
arXiv Detail & Related papers (2024-04-15T04:50:39Z) - Visualizing High-Dimensional Configuration Spaces: A Comprehensive Analytical Approach [0.4143603294943439]
We present a novel approach for visualizing representations of high-dimensional Cs of manipulator robots in a 2D format.
We provide a new tool for qualitative evaluation of high-dimensional Cs approximations without reducing the original dimension.
arXiv Detail & Related papers (2023-12-18T04:05:48Z) - PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic
Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively.
Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system.
We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z) - Scene as Occupancy [66.43673774733307]
OccNet is a vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy.
We propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes.
arXiv Detail & Related papers (2023-06-05T13:01:38Z) - OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy
Prediction [16.66987810790077]
OccFormer is a dual-path transformer network to process the 3D volume for semantic occupancy prediction.
It achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features.
arXiv Detail & Related papers (2023-04-11T16:15:50Z) - UT-Net: Combining U-Net and Transformer for Joint Optic Disc and Cup
Segmentation and Glaucoma Detection [0.0]
Glaucoma is a chronic visual disease that may cause permanent irreversible blindness.
Measurement of the cup-to-disc ratio (CDR) plays a pivotal role in the detection of glaucoma in its early stage, preventing visual disparities.
We propose a new segmentation pipeline, called UT-Net, availing the advantages of U-Net and transformer both in its encoding layer, followed by an attention-gated bilinear fusion scheme.
arXiv Detail & Related papers (2023-03-08T23:21:19Z) - Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design.
CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation.
It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.