Related papers: MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

URL: http://arxiv.org/abs/2509.18613v1
Date: Tue, 23 Sep 2025 04:02:28 GMT
Title: MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving
Authors: Yuzhi Wu, Li Xiao, Jun Liu, Guangfeng Jiang, XiangGen Xia,
Abstract summary: MLF-4DRCNet is a novel framework for 3D object detection via multi-level fusion of 4D radar and camera images.<n>It incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation.<n>It attains performance comparable to LiDAR-based models on the View-of-Delft dataset.
Score: 31.26862558777292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird's-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar's inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.

Related papers

Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection [44.78575994732947]
WRCFormer is a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube.<n>WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios.
arXiv Detail & Related papers (2025-12-28T15:32:17Z)
M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar [12.877894178462297]
M3Detection is a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D radar.<n>We show that M3Detection achieves state-of-the-art 3D detection performance, its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
arXiv Detail & Related papers (2025-10-31T04:34:15Z)
Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective [54.91271106816616]
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency.<n>We propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives.<n> Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps.<n>For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities.<n>For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework.
arXiv Detail & Related papers (2025-05-07T19:37:20Z)
ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving [7.037019489455008]
We propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality.<n> FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales.<n>Experiments show that ZFusion achieved the state-of-the-art mAP in the region of interest.
arXiv Detail & Related papers (2025-04-04T13:29:32Z)
RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection [68.99784784185019]
Poor lighting or adverse weather conditions degrade camera performance.<n>Radar suffers from noise and positional ambiguity.<n>We propose RobuRCDet, a robust object detection model in BEV.
arXiv Detail & Related papers (2025-02-18T17:17:38Z)
MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous Driving [9.184945917823047]
We present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the View-of-Delft (VoD) and TJ4DRadset datasets. It even surpasses classical LiDAR-based methods on the VoD dataset.
arXiv Detail & Related papers (2024-11-22T15:45:23Z)
V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection [64.93675471780209]
We present V2X-R, the first simulated V2X dataset incorporating LiDAR, camera, and 4D radar.<n>V2X-R contains 12,079 scenarios with 37,727 frames of LiDAR and 4D radar point clouds, 150,908 images, and 170,859 annotated 3D vehicle bounding boxes.<n>We propose a novel cooperative LiDAR-4D radar fusion pipeline for 3D object detection and implement it with various fusion strategies.
arXiv Detail & Related papers (2024-11-13T07:41:47Z)
4DRVO-Net: Deep 4D Radar-Visual Odometry Using Multi-Modal and Multi-Scale Adaptive Fusion [2.911052912709637]
Four-dimensional (4D) radar--visual odometry (4DRVO) integrates complementary information from 4D radar and cameras. 4DRVO may exhibit significant tracking errors owing to sparsity of 4D radar point clouds. We present 4DRVO-Net, which is a method for 4D radar--visual odometry.
arXiv Detail & Related papers (2023-08-12T14:00:09Z)
Bridging the View Disparity of Radar and Camera Features for Multi-modal Fusion 3D Object Detection [6.959556180268547]
This paper focuses on how to utilize millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method which realizes the feature-level fusion under bird-eye view (BEV) for a better feature representation is proposed.
arXiv Detail & Related papers (2022-08-25T13:21:37Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)
EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
Volumetric Propagation Network: Stereo-LiDAR Fusion for Long-Range Depth Estimation [81.08111209632501]
We propose a geometry-aware stereo-LiDAR fusion network for long-range depth estimation. We exploit sparse and accurate point clouds as a cue for guiding correspondences of stereo images in a unified 3D volume space. Our network achieves state-of-the-art performance on the KITTI and the Virtual- KITTI datasets.
arXiv Detail & Related papers (2021-03-24T03:24:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.