BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
- URL: http://arxiv.org/abs/2203.17054v1
- Date: Thu, 31 Mar 2022 14:21:19 GMT
- Title: BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
- Authors: Junjie Huang, Guan Huang
- Abstract summary: BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space.
We simplify the velocity learning task by removing the factors of ego-motion and time, which equips BEVDet4D with robust generalization performance.
On challenge benchmark nuScenes, we report a new record of 51.5% NDS with the high-performance configuration dubbed BEVDet4D-Base.
- Score: 14.11339105810819
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Single frame data contains finite information which limits the performance of
the existing vision-based multi-camera 3D object detection paradigms. For
fundamentally pushing the performance boundary in this area, BEVDet4D is
proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to
the spatial-temporal 4D space. We upgrade the framework with a few
modifications just for fusing the feature from the previous frame with the
corresponding one in the current frame. In this way, with negligible extra
computing budget, we enable the algorithm to access the temporal cues by
querying and comparing the two candidate features. Beyond this, we also
simplify the velocity learning task by removing the factors of ego-motion and
time, which equips BEVDet4D with robust generalization performance and reduces
the velocity error by 52.8%. This makes vision-based methods, for the first
time, become comparable with those relied on LiDAR or radar in this aspect. On
challenge benchmark nuScenes, we report a new record of 51.5% NDS with the
high-performance configuration dubbed BEVDet4D-Base, which surpasses the
previous leading method BEVDet by +4.3% NDS.
Related papers
- RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar [15.776076554141687]
3D occupancy-based perception pipeline has significantly advanced autonomous driving.
Current methods rely on LiDAR or camera inputs for 3D occupancy prediction.
We introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction.
arXiv Detail & Related papers (2024-05-22T21:48:17Z) - Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond.
We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner.
Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z) - 4DRVO-Net: Deep 4D Radar-Visual Odometry Using Multi-Modal and
Multi-Scale Adaptive Fusion [2.911052912709637]
Four-dimensional (4D) radar--visual odometry (4DRVO) integrates complementary information from 4D radar and cameras.
4DRVO may exhibit significant tracking errors owing to sparsity of 4D radar point clouds.
We present 4DRVO-Net, which is a method for 4D radar--visual odometry.
arXiv Detail & Related papers (2023-08-12T14:00:09Z) - SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with
4D Imaging Radar [12.842457981088378]
This paper introduces spatial multi-representation fusion (SMURF), a novel approach to 3D object detection using a single 4D imaging radar.
SMURF mitigates measurement inaccuracy caused by limited angular resolution and multi-path propagation of radar signals.
Experimental evaluations on View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate the effectiveness and generalization ability of SMURF.
arXiv Detail & Related papers (2023-07-20T11:33:46Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - Recurrent Vision Transformers for Object Detection with Event Cameras [62.27246562304705]
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras.
RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection.
Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
arXiv Detail & Related papers (2022-12-11T20:28:59Z) - NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed
Neural Radiance Fields [99.57774680640581]
We present an efficient framework capable of fast reconstruction, compact modeling, and streamable rendering.
We propose to decompose the 4D space according to temporal characteristics. Points in the 4D space are associated with probabilities belonging to three categories: static, deforming, and new areas.
arXiv Detail & Related papers (2022-10-28T07:11:05Z) - Learning Spatial and Temporal Variations for 4D Point Cloud Segmentation [0.39373541926236766]
We argue that the temporal information across the frames provides crucial knowledge for 3D scene perceptions.
We design a temporal variation-aware module and a temporal voxel-point refiner to capture the temporal variation in the 4D point cloud.
arXiv Detail & Related papers (2022-07-11T07:36:26Z) - LiDAR-based 4D Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm.
Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods in both tasks.
We extend DS-Net to 4D panoptic LiDAR segmentation by the temporally unified instance clustering on aligned LiDAR frames.
arXiv Detail & Related papers (2022-03-14T15:25:42Z) - Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in
Autonomous Driving [74.74519047735916]
3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors.
Data collected for other use cases (such as virtual reality, gaming, and animation) may not be usable for AV applications.
We propose one of the first approaches to alleviate this problem in the AV setting.
arXiv Detail & Related papers (2021-12-22T18:57:16Z) - BEVDet: High-performance Multi-camera 3D Object Detection in
Bird-Eye-View [15.560366079077449]
We contribute the BEVDet paradigm for pushing the performance boundary in 2D object detection task.
BeVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed.
The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance.
arXiv Detail & Related papers (2021-12-22T10:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.