HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2408.09104v1
- Date: Sat, 17 Aug 2024 05:50:51 GMT
- Title: HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction
- Authors: Xiao Zhao, Bo Chen, Mingyang Sun, Dingkang Yang, Youxing Wang, Xukun Zhang, Mingcheng Li, Dongliang Kou, Xiaoyi Wei, Lihua Zhang,
- Abstract summary: Vision-based 3D semantic scene completion describes autonomous driving scenes through 3D volume representations.
HybridOcc is a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation.
We present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface.
- Score: 14.000919964212857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined in a coarse-to-fine SSC prediction framework. HybridOcc aggregates contextual features through the Transformer paradigm based on hybrid query proposals while combining it with NeRF representation to obtain depth supervision. The Transformer branch contains multiple scales and uses spatial cross-attention for 2D to 3D transformation. The newly designed NeRF branch implicitly infers scene occupancy through volume rendering, including visible and invisible voxels, and explicitly captures scene depth rather than generating RGB color. Furthermore, we present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface, further improving the overall performance. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our HybridOcc on the SSC task.
Related papers
- GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space.
Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information.
The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z) - HorGait: A Hybrid Model for Accurate Gait Recognition in LiDAR Point Cloud Planar Projections [8.56443762544299]
HorGait is a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR.
It achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset.
arXiv Detail & Related papers (2024-10-11T02:12:41Z) - Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction [11.840097269724792]
3D Gaussian Splatting (3DGS) has emerged as a promising approach for 3D scene representation, offering a reduction in computational overhead compared to Neural Radiance Fields (NeRF)
We introduce SVS-GS, a novel framework for Sparse Viewpoint Scene reconstruction that integrates a 3D Gaussian smoothing filter to suppress artifacts.
arXiv Detail & Related papers (2024-09-05T03:18:04Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes [24.227745405760697]
We propose a hybrid optimization method named HO-Gaussian, which combines a grid-based volume with the 3DGS pipeline.
Results on widely used autonomous driving datasets demonstrate that HO-Gaussian achieves photo-realistic rendering in real-time on multi-camera urban datasets.
arXiv Detail & Related papers (2024-03-29T07:58:21Z) - CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs [65.80187860906115]
We propose a novel approach to improve NeRF's performance with sparse inputs.
We first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space.
We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering.
arXiv Detail & Related papers (2024-03-25T15:56:17Z) - StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances.
First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss.
Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy
Prediction [16.66987810790077]
OccFormer is a dual-path transformer network to process the 3D volume for semantic occupancy prediction.
It achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features.
arXiv Detail & Related papers (2023-04-11T16:15:50Z) - Neural Volume Super-Resolution [49.879789224455436]
We propose a neural super-resolution network that operates directly on the volumetric representation of the scene.
To realize our method, we devise a novel 3D representation that hinges on multiple 2D feature planes.
We validate the proposed method by super-resolving multi-view consistent views on a diverse set of unseen 3D scenes.
arXiv Detail & Related papers (2022-12-09T04:54:13Z) - TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [26.87200488085741]
TransformerFusion is a transformer-based 3D scene reconstruction approach.
Network learns to attend to the most relevant image frames for each 3D location in the scene.
Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed.
arXiv Detail & Related papers (2021-07-05T18:00:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.