Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement
- URL: http://arxiv.org/abs/2407.13155v2
- Date: Sun, 21 Jul 2024 07:28:19 GMT
- Title: Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement
- Authors: Yulin He, Wei Chen, Tianci Xun, Yusong Tan,
- Abstract summary: Occupancy prediction plays a pivotal role in autonomous driving (AD)
Existing methods often incur high computational costs, which contradicts the real-time demands of AD.
We propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation.
- Score: 8.592248643229675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Occupancy prediction plays a pivotal role in autonomous driving (AD) due to the fine-grained geometric perception and general object recognition capabilities. However, existing methods often incur high computational costs, which contradicts the real-time demands of AD. To this end, we first evaluate the speed and memory usage of most public available methods, aiming to redirect the focus from solely prioritizing accuracy to also considering efficiency. We then identify a core challenge in achieving both fast and accurate performance: \textbf{the strong coupling between geometry and semantic}. To address this issue, 1) we propose a Geometric-Semantic Dual-Branch Network (GSDBN) with a hybrid BEV-Voxel representation. In the BEV branch, a BEV-level temporal fusion module and a U-Net encoder is introduced to extract dense semantic features. In the voxel branch, a large-kernel re-parameterized 3D convolution is proposed to refine sparse 3D geometry and reduce computation. Moreover, we propose a novel BEV-Voxel lifting module that projects BEV features into voxel space for feature fusion of the two branches. In addition to the network design, 2) we also propose a Geometric-Semantic Decoupled Learning (GSDL) strategy. This strategy initially learns semantics with accurate geometry using ground-truth depth, and then gradually mixes predicted depth to adapt the model to the predicted geometry. Extensive experiments on the widely-used Occ3D-nuScenes benchmark demonstrate the superiority of our method, which achieves a 39.4 mIoU with 20.0 FPS. This result is $\sim 3 \times$ faster and +1.9 mIoU higher compared to FB-OCC, the winner of CVPR2023 3D Occupancy Prediction Challenge. Our code will be made open-source.
Related papers
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction [10.394184895110007]
We present GPOcc, a framework that leverages visual geometry priors for monocular occupancy prediction.<n>Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains.
arXiv Detail & Related papers (2026-02-25T04:16:54Z) - HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation [39.58684038370709]
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots.<n>Previous state-of-the-art methods often face a trade-off between accuracy and speed.<n>We introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network.
arXiv Detail & Related papers (2025-10-08T10:46:07Z) - GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation [57.8059956428009]
Recent attempts to transfer features from 2D Vision-Language Models to 3D semantic segmentation expose a persistent trade-off.<n>We propose GeoPurify that applies a small Student Affinity Network to 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model.<n>Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency.
arXiv Detail & Related papers (2025-10-02T16:37:56Z) - Unleashing Semantic and Geometric Priors for 3D Scene Completion [18.515824341739]
Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation.<n>Existing methods rely on a coupled encoder to deliver both semantic and geometric priors.<n>We propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels.
arXiv Detail & Related papers (2025-08-19T08:10:39Z) - Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition [4.196626042312499]
We propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning.<n>Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density.<n>With the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird's-eye view and 3D segment perspectives.
arXiv Detail & Related papers (2025-06-17T07:04:07Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler [43.277357306520145]
This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations.
A Geometry-guided Refinement Module (GRM) constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS)
Experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc achieves state-of-the-art performance across different settings.
arXiv Detail & Related papers (2025-04-13T12:10:49Z) - Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation [32.50849425431012]
For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions.
Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space.
In this work, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence.
arXiv Detail & Related papers (2024-11-19T02:40:42Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes [53.107474952492396]
CityGaussianV2 is a novel approach for large-scale scene reconstruction.
We implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence.
Our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs.
arXiv Detail & Related papers (2024-11-01T17:59:31Z) - OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity [11.287721740276048]
3D semantic occupancy prediction networks have demonstrated remarkable capabilities in reconstructing the geometric and semantic structure of 3D scenes.
We introduce OccRWKV, an efficient semantic occupancy network inspired by Receptance Weighted Key Value (RWKV)
OccRWKV separates semantics, occupancy prediction, and feature fusion into distinct branches, each incorporating Sem-RWKV and Geo-RWKV blocks.
arXiv Detail & Related papers (2024-09-30T06:27:50Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection [36.245654685143016]
Bird's-Eye-View (BEV) representation has emerged as a mainstream paradigm for multi-view 3D object detection.
Existing methods overlook the geometric quality of BEV representation, leaving it in a low-resolution state.
We propose Radial-Cartesian BEV Sampling (RC-Sampling) to generate high-resolution dense BEV representation.
In conjunction with the In-Box Label, Centroid-Aware Inner Loss (CAI Loss) is developed to capture the inner geometric structure of objects.
arXiv Detail & Related papers (2024-09-03T11:57:36Z) - IGEV++: Iterative Multi-range Geometry Encoding Volumes for Stereo Matching [7.859381791267791]
We propose a new deep network architecture, called IGEV++, for stereo matching.
The proposed IGEV++ builds Multi-range Geometry Volumes (MGEV) that encode coarse-grained geometry information for ill-posed regions.
We introduce an adaptive patch matching module that efficiently computes matching costs for large disparity ranges and/or ill-posed regions.
arXiv Detail & Related papers (2024-09-01T07:02:36Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - Fully Sparse Fusion for 3D Object Detection [69.32694845027927]
Currently prevalent multimodal 3D detection methods are built upon LiDAR-based detectors that usually use dense Bird's-Eye-View feature maps.
Fully sparse architecture is gaining attention as they are highly efficient in long-range perception.
In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture.
arXiv Detail & Related papers (2023-04-24T17:57:43Z) - BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object
Detection for Autonomous Driving [6.389322215324224]
We propose a novel semantic segmentation architecture as a single unified model for object center detection using key points, box predictions and orientation prediction.
The proposed architecture can be trivially extended to include semantic segmentation classes like road without any additional computation.
The model is 5X faster than other top accuracy models with a minimal accuracy degradation of 2% in Average Precision at IoU=0.5 on KITTI dataset.
arXiv Detail & Related papers (2021-04-21T22:06:39Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Monocular 3D Detection with Geometric Constraints Embedding and
Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net.
We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.