Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
- URL: http://arxiv.org/abs/2302.07817v1
- Date: Wed, 15 Feb 2023 17:58:10 GMT
- Title: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
- Authors: Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu
- Abstract summary: We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes.
We model each point in the 3D space by summing its projected features on the three planes.
Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
- Score: 84.94140661523956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern methods for vision-centric autonomous driving perception widely adopt
the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its
better efficiency than voxel representation, it has difficulty describing the
fine-grained 3D structure of a scene with a single plane. To address this, we
propose a tri-perspective view (TPV) representation which accompanies BEV with
two additional perpendicular planes. We model each point in the 3D space by
summing its projected features on the three planes. To lift image features to
the 3D TPV space, we further propose a transformer-based TPV encoder
(TPVFormer) to obtain the TPV features effectively. We employ the attention
mechanism to aggregate the image features corresponding to each query in each
TPV plane. Experiments show that our model trained with sparse supervision
effectively predicts the semantic occupancy for all voxels. We demonstrate for
the first time that using only camera inputs can achieve comparable performance
with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code:
https://github.com/wzzheng/TPVFormer.
Related papers
- Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic
Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively.
Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system.
We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z) - Unsupervised Multi-view Pedestrian Detection [12.882317991955228]
We propose an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to eliminate the need of annotations to learn a multi-view pedestrian detector via 2D-3D mapping.
SIS is proposed to extract unsupervised representations of multi-view images, which are converted into 2D pedestrian masks as pseudo labels.
GVD encodes multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D mapping.
arXiv Detail & Related papers (2023-05-21T13:27:02Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - Learning Ego 3D Representation as Ray Tracing [42.400505280851114]
We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation.
We show that our model outperforms all state-of-the-art alternatives significantly.
arXiv Detail & Related papers (2022-06-08T17:55:50Z) - Voxelized 3D Feature Aggregation for Multiview Detection [15.465855460519446]
We propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection.
Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels.
This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent.
arXiv Detail & Related papers (2021-12-07T03:38:50Z) - Monocular Road Planar Parallax Estimation [25.36368935789501]
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving.
We propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences.
RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a $gamma$ map for 3D reconstruction.
arXiv Detail & Related papers (2021-11-22T10:03:41Z) - Multi-Plane Program Induction with 3D Box Priors [110.6726150681556]
We present Box Program Induction (BPI), which infers a program-like scene representation from a single image.
BPI simultaneously models repeated structure on multiple 2D planes, the 3D position and orientation of the planes, and camera parameters.
It uses neural networks to infer visual cues such as vanishing points, wireframe lines to guide a search-based algorithm to find the program that best explains the image.
arXiv Detail & Related papers (2020-11-19T18:07:46Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.