Related papers: TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

URL: http://arxiv.org/abs/2410.11228v1
Date: Tue, 15 Oct 2024 03:20:48 GMT
Title: TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement
Authors: Zhiwei Lin, Hongbo Jin, Yongtao Wang, Yufei Wei, Nan Dong,
Abstract summary: We propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks.
Score: 5.860326420490923
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Specifically, we introduce a temporal enhancement branch to learn temporal occupancy prediction. In this branch, we randomly discard the t-k input frame of the multi-view camera and predict its 3D occupancy by long-term and short-term temporal decoders separately with the information from other adjacent frames and multi-modal inputs. Besides, to reduce computational costs and incorporate multi-modal inputs, we specially designed 3D convolutional layers for long-term and short-term temporal decoders. Furthermore, since the lightweight occupancy prediction head is a dense classification head, we propose to use a shared occupancy prediction head for the temporal enhancement and main branches. It is worth noting that the temporal enhancement branch is only performed during training and is discarded during inference. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks. In addition, the proposed temporal enhancement branch is a plug-and-play module that can be easily integrated into existing occupancy prediction methods to improve the performance of occupancy prediction. The code and models will be released at https://github.com/VDIGPKU/TEOcc.

Related papers

LMPOcc: 3D Semantic Occupancy Prediction Utilizing Long-Term Memory Prior from Historical Traversals [4.970345700893879]
Longterm Memory Prior Occupancy (LMPOcc) is the first 3D occupancy prediction methodology that exploits long-term memory priors derived from historical perceptual outputs. We introduce a plug-and-play architecture that integrates long-term memory priors to enhance local perception while simultaneously constructing global occupancy representations.
arXiv Detail & Related papers (2025-04-18T09:58:48Z)
Tracking Meets Large Multimodal Models for Driving Scenario Understanding [76.71815464110153]
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details. We introduce a novel approach for embedding this tracking information into LMMs to enhance their understanding of driving scenarios.
arXiv Detail & Related papers (2025-03-18T17:59:12Z)
Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection [9.053936905556204]
We propose a model called DAP (Detection After Prediction), consisting of a two-branch network. The features predicting the current objects from branch (i) is fused into branch (ii) to transfer predictive knowledge. Our model can be used plug-and-play, showing consistent performance gain.
arXiv Detail & Related papers (2024-04-02T02:20:47Z)
OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow [0.6577148087211809]
We present a novel approach to occupancy estimation inspired by neural radiance field (NeRF) using only 2D labels. We employ differentiable volumetric rendering to predict depth and semantic maps and train a 3D network based on 2D supervision only.
arXiv Detail & Related papers (2024-02-20T08:04:12Z)
A Spatiotemporal Approach to Tri-Perspective Representation for 3D Semantic Occupancy Prediction [6.527178779672975]
Vision-based 3D semantic occupancy prediction is increasingly overlooked in favor of LiDAR-based approaches. This study introduces S2TPVFormer, a transformer architecture designed to predict temporally coherent 3D semantic occupancy.
arXiv Detail & Related papers (2024-01-24T20:06:59Z)
Visual Point Cloud Forecasting enables Scalable Autonomous Driving [28.376086570498952]
Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously. We present ViDAR, a general model to pre-train downstream visual encoders. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, 10% error reduction on motion forecasting, and 15% less collision rate on planning.
arXiv Detail & Related papers (2023-12-29T15:44:13Z)
OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision. We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z)
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving [68.95178518732965]
A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory of the detected objects, or predict dense occupancy and flow grids for the whole scene. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network.
arXiv Detail & Related papers (2023-08-02T23:39:24Z)
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z)
BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
SLPC: a VRNN-based approach for stochastic lidar prediction and completion in autonomous driving [63.87272273293804]
We propose a new LiDAR prediction framework that is based on generative models namely Variational Recurrent Neural Networks (VRNNs) Our algorithm is able to address the limitations of previous video prediction frameworks when dealing with sparse data by spatially inpainting the depth maps in the upcoming frames. We present a sparse version of VRNNs and an effective self-supervised training method that does not require any labels.
arXiv Detail & Related papers (2021-02-19T11:56:44Z)
Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)
A Spatio-temporal Transformer for 3D Human Motion Prediction [39.31212055504893]
We propose a Transformer-based architecture for the task of generative modelling of 3D human motion. We empirically show that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-gressive models.
arXiv Detail & Related papers (2020-04-18T19:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.