DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
- URL: http://arxiv.org/abs/2506.10242v1
- Date: Wed, 11 Jun 2025 23:49:56 GMT
- Title: DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
- Authors: Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli,
- Abstract summary: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving.<n>We propose DySS, a novel method that employs state-space learning and dynamic queries.<n>Our proposed DySS achieves both superior detection performance and efficient inference.
- Score: 53.52664872583893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
Related papers
- State Space Model Meets Transformer: A New Paradigm for 3D Object Detection [33.49952392298874]
We propose a new 3D object DEtection paradigm with an interactive STate space model (DEST)<n>In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks.<n>Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 and SUN RGB-D datasets.
arXiv Detail & Related papers (2025-03-18T17:58:03Z) - S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking [38.63155724204429]
3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception.<n>Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task.<n>Existing methods are still in the early stages of development and lack systematic improvements.
arXiv Detail & Related papers (2024-06-04T09:34:46Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D
Object Detection [20.161887223481994]
We propose a long-sequence modeling framework, named StreamPETR, for multi-view 3D object detection.
StreamPETR achieves significant performance improvements only with negligible cost, compared to the single-frame baseline.
The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS.
arXiv Detail & Related papers (2023-03-21T15:19:20Z) - DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection [113.5418064456229]
We propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features.
It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost.
arXiv Detail & Related papers (2022-07-22T07:08:42Z) - Exploring Diversity-based Active Learning for 3D Object Detection in Autonomous Driving [45.405303803618]
We investigate diversity-based active learning (AL) as a potential solution to alleviate the annotation burden.
We propose a novel acquisition function that enforces spatial and temporal diversity in the selected samples.
We demonstrate the effectiveness of the proposed method on the nuScenes dataset and show that it outperforms existing AL strategies significantly.
arXiv Detail & Related papers (2022-05-16T14:21:30Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.