RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
- URL: http://arxiv.org/abs/2508.15353v1
- Date: Thu, 21 Aug 2025 08:33:36 GMT
- Title: RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
- Authors: Olga Matykina, Dmitry Yudin,
- Abstract summary: Three-dimensional object detection is essential for autonomous driving and robotics.<n>This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features.<n> Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.
Related papers
- Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection [44.78575994732947]
WRCFormer is a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube.<n>WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios.
arXiv Detail & Related papers (2025-12-28T15:32:17Z) - PAN: Pillars-Attention-Based Network for 3D Object Detection [3.3274570204477922]
This work presents a novel 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV)<n>Our algorithm exploits the advantages of radar before fusing the features into a detection head.<n>A new backbone is introduced, which maps the radar pillar features into an embedded dimension.
arXiv Detail & Related papers (2025-09-19T12:40:49Z) - Multi-Representation Adapter with Neural Architecture Search for Efficient Range-Doppler Radar Object Detection [16.02875655103583]
This paper presents an efficient object detection model for Range-Doppler radar maps.<n>We first represent RD radar maps with multi-representation, i.e., heatmaps and grayscale images, to gather high-level object and fine-grained texture features.<n>We employ a One-Shot Neural Architecture Search method to further improve the model's efficiency while maintaining high performance.
arXiv Detail & Related papers (2025-09-01T09:06:53Z) - RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection [68.99784784185019]
Poor lighting or adverse weather conditions degrade camera performance.<n>Radar suffers from noise and positional ambiguity.<n>We propose RobuRCDet, a robust object detection model in BEV.
arXiv Detail & Related papers (2025-02-18T17:17:38Z) - SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection [5.36022165180739]
We present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features.<n> Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors.
arXiv Detail & Related papers (2024-11-29T17:17:38Z) - V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection [64.93675471780209]
We present V2X-R, the first simulated V2X dataset incorporating LiDAR, camera, and 4D radar.<n>V2X-R contains 12,079 scenarios with 37,727 frames of LiDAR and 4D radar point clouds, 150,908 images, and 170,859 annotated 3D vehicle bounding boxes.<n>We propose a novel cooperative LiDAR-4D radar fusion pipeline for 3D object detection and implement it with various fusion strategies.
arXiv Detail & Related papers (2024-11-13T07:41:47Z) - UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection [2.123197540438989]
Many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information.
We propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process.
We also introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities.
arXiv Detail & Related papers (2024-09-23T06:57:27Z) - RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network [34.45694077040797]
We present a radar-camera fusion 3D object detection framework called BEEVDet.
RadarBEVNet encodes sparse radar points into a dense bird's-eye-view feature.
Our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks.
arXiv Detail & Related papers (2024-09-08T05:14:27Z) - FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision
Transformer Fusion [8.168523242105763]
We will introduce a novel vision transformer-based 3D object detection model, namely FusionViT.
Our FusionViT model can achieve state-of-the-art performance and outperforms existing baseline methods.
arXiv Detail & Related papers (2023-11-07T00:12:01Z) - Reviewing 3D Object Detectors in the Context of High-Resolution 3+1D
Radar [0.7279730418361995]
High-resolution imaging 4D (3+1D) radar sensors have deep learning-based radar perception research.
We investigate deep learning-based models operating on radar point clouds for 3D object detection.
arXiv Detail & Related papers (2023-08-10T10:10:43Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers [78.48081972698888]
We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
arXiv Detail & Related papers (2021-04-24T06:48:23Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.