InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction
- URL: http://arxiv.org/abs/2401.12422v2
- Date: Mon, 29 Apr 2024 07:14:45 GMT
- Title: InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction
- Authors: Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall,
- Abstract summary: InverseMatrixVT3D is an efficient method for transforming multi-view image features into 3D feature volumes for semantic occupancy prediction.
We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage.
Our approach achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety.
- Score: 11.33083039877258
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: https://github.com/DanielMing123/InverseMatrixVT3D
Related papers
- LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories.
We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - Regulating Intermediate 3D Features for Vision-Centric Autonomous
Driving [26.03800936700545]
We propose to regulate intermediate dense 3D features with the help of volume rendering.
Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features.
arXiv Detail & Related papers (2023-12-19T04:09:05Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy
Prediction [16.66987810790077]
OccFormer is a dual-path transformer network to process the 3D volume for semantic occupancy prediction.
It achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features.
arXiv Detail & Related papers (2023-04-11T16:15:50Z) - HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object
Detection [39.64891219500416]
3D object detection methods exploit either voxel-based or point-based features to represent 3D objects in a scene.
We introduce in this paper a novel single-stage 3D detection method having the merit of both voxel-based and point-based features.
arXiv Detail & Related papers (2021-04-02T06:34:49Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z) - Virtual Multi-view Fusion for 3D Semantic Segmentation [11.259694096475766]
We show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches.
When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results.
arXiv Detail & Related papers (2020-07-26T14:46:55Z) - PerMO: Perceiving More at Once from a Single Image for Autonomous
Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image.
Our approach combines the strengths of deep learning and the elegance of traditional techniques.
We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.