Related papers: HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

URL: http://arxiv.org/abs/2408.09104v1
Date: Sat, 17 Aug 2024 05:50:51 GMT
Title: HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction
Authors: Xiao Zhao, Bo Chen, Mingyang Sun, Dingkang Yang, Youxing Wang, Xukun Zhang, Mingcheng Li, Dongliang Kou, Xiaoyi Wei, Lihua Zhang,
Abstract summary: Vision-based 3D semantic scene completion describes autonomous driving scenes through 3D volume representations. HybridOcc is a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation. We present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface.
Score: 14.000919964212857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined in a coarse-to-fine SSC prediction framework. HybridOcc aggregates contextual features through the Transformer paradigm based on hybrid query proposals while combining it with NeRF representation to obtain depth supervision. The Transformer branch contains multiple scales and uses spatial cross-attention for 2D to 3D transformation. The newly designed NeRF branch implicitly infers scene occupancy through volume rendering, including visible and invisible voxels, and explicitly captures scene depth rather than generating RGB color. Furthermore, we present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface, further improving the overall performance. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our HybridOcc on the SSC task.

Related papers

Robust Mesh Saliency GT Acquisition in VR via View Cone Sampling and Geometric Smoothing [59.12032628787018]
3D mesh saliency ground truth is essential for human-centric visual modeling in virtual reality (VR)<n>Current VR eye-tracking pipelines rely on single ray sampling and Euclidean smoothing, triggering texture attention and signal leakage across gaps.<n>This paper proposes a robust framework to address these limitations.
arXiv Detail & Related papers (2026-01-06T05:20:12Z)
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z)
BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation [54.12899218104669]
3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures.<n>Current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators.<n>We propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation.
arXiv Detail & Related papers (2025-01-15T11:33:34Z)
Semantic Scene Completion with Multi-Feature Data Balancing Network [5.3431413737671525]
We propose a dual-head model for RGB and depth data (F-TSDF) inputs. Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module effectively manages diverse signals within F-TSDF. We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions.
arXiv Detail & Related papers (2024-12-02T12:12:21Z)
T-3DGS: Removing Transient Objects for 3D Scene Reconstruction [83.05271859398779]
Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions. We propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting.
arXiv Detail & Related papers (2024-11-29T07:45:24Z)
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z)
HorGait: A Hybrid Model for Accurate Gait Recognition in LiDAR Point Cloud Planar Projections [8.56443762544299]
HorGait is a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR. It achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset.
arXiv Detail & Related papers (2024-10-11T02:12:41Z)
Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction [11.840097269724792]
3D Gaussian Splatting (3DGS) has emerged as a promising approach for 3D scene representation, offering a reduction in computational overhead compared to Neural Radiance Fields (NeRF) We introduce SVS-GS, a novel framework for Sparse Viewpoint Scene reconstruction that integrates a 3D Gaussian smoothing filter to suppress artifacts.
arXiv Detail & Related papers (2024-09-05T03:18:04Z)
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z)
HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes [24.227745405760697]
We propose a hybrid optimization method named HO-Gaussian, which combines a grid-based volume with the 3DGS pipeline. Results on widely used autonomous driving datasets demonstrate that HO-Gaussian achieves photo-realistic rendering in real-time on multi-camera urban datasets.
arXiv Detail & Related papers (2024-03-29T07:58:21Z)
CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs [65.80187860906115]
We propose a novel approach to improve NeRF's performance with sparse inputs. We first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering.
arXiv Detail & Related papers (2024-03-25T15:56:17Z)
Hybrid Explicit Representation for Ultra-Realistic Head Avatars [55.829497543262214]
We introduce a novel approach to creating ultra-realistic head avatars and rendering them in real-time. UV-mapped 3D mesh is utilized to capture sharp and rich textures on smooth surfaces, while 3D Gaussian Splatting is employed to represent complex geometric structures. Experiments that our modeled results exceed those of state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T04:01:26Z)
StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances. First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z)
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels. We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z)
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction [16.66987810790077]
OccFormer is a dual-path transformer network to process the 3D volume for semantic occupancy prediction. It achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features.
arXiv Detail & Related papers (2023-04-11T16:15:50Z)
Neural Volume Super-Resolution [49.879789224455436]
We propose a neural super-resolution network that operates directly on the volumetric representation of the scene. To realize our method, we devise a novel 3D representation that hinges on multiple 2D feature planes. We validate the proposed method by super-resolving multi-view consistent views on a diverse set of unseen 3D scenes.
arXiv Detail & Related papers (2022-12-09T04:54:13Z)
TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [26.87200488085741]
TransformerFusion is a transformer-based 3D scene reconstruction approach. Network learns to attend to the most relevant image frames for each 3D location in the scene. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed.
arXiv Detail & Related papers (2021-07-05T18:00:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.