Leveraging Transformer Decoder for Automotive Radar Object Detection
- URL: http://arxiv.org/abs/2601.13386v1
- Date: Mon, 19 Jan 2026 20:44:24 GMT
- Title: Leveraging Transformer Decoder for Automotive Radar Object Detection
- Authors: Changxu Zhang, Zhaoze Wang, Tai Fei, Christopher Grimm, Yi Jin, Claas Tebruegge, Ernst Warsitz, Markus Gardill,
- Abstract summary: We present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder.<n> Pyramid Token Fusion (PTF) converts a feature pyramid into a unified, scale-aware token sequence.<n>We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
- Score: 9.764772760421792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
Related papers
- TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder [66.22997415145467]
This paper presents a joint completion and detection framework that improves the detection feature in sparse areas.<n> Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.<n>The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods.
arXiv Detail & Related papers (2025-12-12T00:08:03Z) - PAN: Pillars-Attention-Based Network for 3D Object Detection [3.3274570204477922]
This work presents a novel 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV)<n>Our algorithm exploits the advantages of radar before fusing the features into a detection head.<n>A new backbone is introduced, which maps the radar pillar features into an embedded dimension.
arXiv Detail & Related papers (2025-09-19T12:40:49Z) - Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving [11.221694136475554]
We propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the input radar data.<n>Mask-RadarNet exploits the combination of interleaved convolution and attention operations to replace the traditional architecture in transformer-based models.<n>With relatively lower computational complexity and fewer parameters, the proposed Mask-RadarNet achieves higher recognition accuracy for object detection in autonomous driving.
arXiv Detail & Related papers (2024-12-20T06:39:40Z) - RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection [16.37397687985041]
In radar-camera 3D object detection, the radar point clouds are sparse and noisy.<n>We introduce a novel query-based detection method named Radar-Pruning Transformer (RCTrans)<n>Our method achieves new state-of-the-art radar-camera 3D detection results.
arXiv Detail & Related papers (2024-12-17T11:02:36Z) - RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion [58.77329237533034]
We propose a Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection.<n>RaCFormer achieves superior results of 64.9% mAP and 70.2% on nuScenes datasets.
arXiv Detail & Related papers (2024-12-17T09:47:48Z) - SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection [7.889379973011702]
We present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features.<n> Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors.
arXiv Detail & Related papers (2024-11-29T17:17:38Z) - ARFA: An Asymmetric Receptive Field Autoencoder Model for Spatiotemporal
Prediction [55.30913411696375]
We propose an Asymmetric Receptive Field Autoencoder (ARFA) model, which introduces corresponding sizes of receptive field modules.
In the encoder, we present large kernel module for globaltemporal feature extraction. In the decoder, we develop a small kernel module for localtemporal reconstruction.
We construct the RainBench, a large-scale radar echo dataset for precipitation prediction, to address the scarcity of meteorological data in the domain.
arXiv Detail & Related papers (2023-09-01T07:55:53Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving [121.44554957537613]
We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
arXiv Detail & Related papers (2020-11-27T09:35:39Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.