Temporal Enhanced Training of Multi-view 3D Object Detector via
Historical Object Prediction
- URL: http://arxiv.org/abs/2304.00967v1
- Date: Mon, 3 Apr 2023 13:35:29 GMT
- Title: Temporal Enhanced Training of Multi-view 3D Object Detector via
Historical Object Prediction
- Authors: Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su,
Hongsheng Li, Yu Liu
- Abstract summary: We propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection.
We generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k.
As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks.
- Score: 28.800204844558518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new paradigm, named Historical Object Prediction
(HoP) for multi-view 3D detection to leverage temporal information more
effectively. The HoP approach is straightforward: given the current timestamp
t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its
adjacent frames and utilize this feature to predict the object set at timestamp
t-k. Our approach is motivated by the observation that enforcing the detector
to capture both the spatial location and temporal motion of objects occurring
at historical timestamps can lead to more accurate BEV feature learning. First,
we elaborately design short-term and long-term temporal decoders, which can
generate the pseudo BEV feature for timestamp t-k without the involvement of
its corresponding camera images. Second, an additional object decoder is
flexibly attached to predict the object targets using the generated pseudo BEV
feature. Note that we only perform HoP during training, thus the proposed
method does not introduce extra overheads during inference. As a plug-and-play
approach, HoP can be easily incorporated into state-of-the-art BEV detection
frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary
HoP approach is complementary to prevalent temporal modeling methods, leading
to significant performance gains. Extensive experiments are conducted to
evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We
choose the representative methods, including BEVFormer and BEVDet4D-Depth to
evaluate our method. Surprisingly, HoP achieves 68.5% NDS and 62.4% mAP with
ViT-L on nuScenes test, outperforming all the 3D object detectors on the
leaderboard. Codes will be available at https://github.com/Sense-X/HoP.
Related papers
- Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection [9.053936905556204]
We propose a model called DAP (Detection After Prediction), consisting of a two-branch network.
The features predicting the current objects from branch (i) is fused into branch (ii) to transfer predictive knowledge.
Our model can be used plug-and-play, showing consistent performance gain.
arXiv Detail & Related papers (2024-04-02T02:20:47Z) - PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - 3D Video Object Detection with Learnable Object-Centric Global
Optimization [65.68977894460222]
Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection.
We propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment.
arXiv Detail & Related papers (2023-03-27T17:39:39Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - MGTANet: Encoding Sequential LiDAR Points Using Long Short-Term
Motion-Guided Temporal Attention for 3D Object Detection [8.305942415868042]
Most LiDAR sensors generate a sequence of point clouds in real-time.
Recent studies have revealed that substantial performance improvement can be achieved by exploiting the context present in a sequence of point sets.
We propose a novel 3D object detection architecture, which can encode point cloud sequences acquired by multiple successive scans.
arXiv Detail & Related papers (2022-12-01T11:24:47Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.