Related papers: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

URL: http://arxiv.org/abs/2302.07817v1
Date: Wed, 15 Feb 2023 17:58:10 GMT
Title: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
Authors: Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu
Abstract summary: We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
Score: 84.94140661523956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Related papers

VGGT: Visual Geometry Grounded Transformer [61.37669770946458]
VGGT is a feed-forward neural network that directly infers all key 3D attributes of a scene. Network achieves state-of-the-art results in multiple 3D tasks.
arXiv Detail & Related papers (2025-03-14T17:59:47Z)
SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation [50.420711084672966]
We present SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories.
arXiv Detail & Related papers (2025-01-28T03:41:24Z)
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving [44.91443640710085]
VisionPAD is a novel self-supervised pre-training paradigm for vision-centric algorithms in autonomous driving. It reconstructs multi-view representations using only images as supervision. It significantly improves performance in 3D object detection, occupancy prediction and map segmentation.
arXiv Detail & Related papers (2024-11-22T03:59:41Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction [72.75478398447396]
We propose a cylindrical tri-perspective view to represent point clouds effectively and comprehensively. Considering the distance distribution of LiDAR point clouds, we construct the tri-perspective view in the cylindrical coordinate system. We employ spatial group pooling to maintain structural details during projection and adopt 2D backbones to efficiently process each TPV plane.
arXiv Detail & Related papers (2023-08-31T17:57:17Z)
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving. We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
Learning Ego 3D Representation as Ray Tracing [42.400505280851114]
We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views. Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation. We show that our model outperforms all state-of-the-art alternatives significantly.
arXiv Detail & Related papers (2022-06-08T17:55:50Z)
Voxelized 3D Feature Aggregation for Multiview Detection [15.465855460519446]
We propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent.
arXiv Detail & Related papers (2021-12-07T03:38:50Z)
Monocular Road Planar Parallax Estimation [25.36368935789501]
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving. We propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences. RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a $gamma$ map for 3D reconstruction.
arXiv Detail & Related papers (2021-11-22T10:03:41Z)
Multi-Plane Program Induction with 3D Box Priors [110.6726150681556]
We present Box Program Induction (BPI), which infers a program-like scene representation from a single image. BPI simultaneously models repeated structure on multiple 2D planes, the 3D position and orientation of the planes, and camera parameters. It uses neural networks to infer visual cues such as vanishing points, wireframe lines to guide a search-based algorithm to find the program that best explains the image.
arXiv Detail & Related papers (2020-11-19T18:07:46Z)
Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.