MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection
- URL: http://arxiv.org/abs/2411.13628v1
- Date: Wed, 20 Nov 2024 14:47:18 GMT
- Title: MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection
- Authors: Tong Ning, Ke Lu, Xirui Jiang, Jian Xue,
- Abstract summary: We propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space.
On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task.
- Score: 18.13821223763173
- License:
- Abstract: Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.
Related papers
- Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences [25.74000325019015]
We introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information.
We have conducted experiments on the aggregation and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance.
arXiv Detail & Related papers (2024-09-06T16:29:04Z) - PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection [40.267769862404684]
We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds.
Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector.
arXiv Detail & Related papers (2023-09-28T21:58:25Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.