Related papers: Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction

Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction

URL: http://arxiv.org/abs/2401.13785v2
Date: Thu, 4 Apr 2024 13:52:17 GMT
Title: Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction
Authors: Sathira Silva, Savindu Bhashitha Wannigama, Gihan Jayatilaka, Muhammad Haris Khan, Roshan Ragel,
Abstract summary: This study introduces architecture2TPVFormer for temporally coherent 3D semantic occupancy prediction. We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism. Experimental evaluations demonstrate a substantial 4.1% improvement in mean Intersection over Union for 3D Semantic Occupancy.
Score: 6.527178779672975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Holistic understanding and reasoning in 3D scenes play a vital role in the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic downstream tasks capture finer 3D details compared to methods like 3D detection. Existing approaches predominantly focus on spatial cues such as tri-perspective view embeddings (TPV), often overlooking temporal cues. This study introduces a spatiotemporal transformer architecture S2TPVFormer for temporally coherent 3D semantic occupancy prediction. We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism (TCVHA) and generate spatiotemporal TPV embeddings (i.e. S2TPV embeddings). Experimental evaluations on the nuScenes dataset demonstrate a substantial 4.1% improvement in mean Intersection over Union (mIoU) for 3D Semantic Occupancy compared to TPVFormer, confirming the effectiveness of the proposed S2TPVFormer in enhancing 3D scene perception.

Related papers

Real-Time 3D Object Detection with Inference-Aligned Learning [20.94871746774727]
Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation.<n>We introduce a novel Spatial-prioritized and Rank-aware 3D object detection framework to bridge the gap between how detectors are trained and how they are evaluated.
arXiv Detail & Related papers (2025-11-20T08:27:00Z)
One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion [3.664655957801223]
In real-world traffic scenarios, a significant portion of a visual 3D scene remains occluded or outside the camera's field of view.<n>We propose Creating the Future SSC, a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model's effective perceptual range.<n>Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space.
arXiv Detail & Related papers (2025-07-18T10:24:58Z)
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation [59.33052312107478]
Event cameras offer possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMove, a novel event-based framework that models-uniform trajectories via event-guided parametric curves. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, flows and depth motion fields.
arXiv Detail & Related papers (2025-03-14T13:15:54Z)
H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision [41.529084775662355]
We present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods. In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels.
arXiv Detail & Related papers (2025-03-06T03:27:14Z)
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving [44.91443640710085]
VisionPAD is a novel self-supervised pre-training paradigm for vision-centric algorithms in autonomous driving. It reconstructs multi-view representations using only images as supervision. It significantly improves performance in 3D object detection, occupancy prediction and map segmentation.
arXiv Detail & Related papers (2024-11-22T03:59:41Z)
FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View [46.81548000021799]
In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design. A new method, dubbed FastOcc, is proposed to accelerate the model while keeping its accuracy. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves a fast inference speed.
arXiv Detail & Related papers (2024-03-05T07:01:53Z)
OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow [0.6577148087211809]
We present a novel approach to occupancy estimation inspired by neural radiance field (NeRF) using only 2D labels. We employ differentiable volumetric rendering to predict depth and semantic maps and train a 3D network based on 2D supervision only.
arXiv Detail & Related papers (2024-02-20T08:04:12Z)
Visual Point Cloud Forecasting enables Scalable Autonomous Driving [28.376086570498952]
Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously. We present ViDAR, a general model to pre-train downstream visual encoders. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, 10% error reduction on motion forecasting, and 15% less collision rate on planning.
arXiv Detail & Related papers (2023-12-29T15:44:13Z)
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z)
Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting [79.34357055254239]
Hand trajectory forecasting is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. Existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. We set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.
arXiv Detail & Related papers (2023-07-17T04:55:02Z)
Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [84.94140661523956]
We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
arXiv Detail & Related papers (2023-02-15T17:58:10Z)
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z)
On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. Much of the recent attention has shifted towards semi and (or) weakly supervised learning. We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z)
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks. Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data. STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.