Related papers: MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

URL: http://arxiv.org/abs/2501.15384v2
Date: Thu, 07 Aug 2025 10:39:28 GMT
Title: MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies
Authors: Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu,
Abstract summary: This paper introduces MetaOcc, a novel multi-modal framework for omni-oriented 3D occupancy prediction.<n>To address the limitations of directly applying encoders to sparse radar data, we propose a Radar Height Self-Attention module.<n>To reduce reliance on expensive point cloud, we propose a pseudo-label generation pipeline based on an open-set segmentor.
Score: 12.485905108032146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.

Related papers

NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments [56.35569661650558]
We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation.<n>Rather than constructing a global map, NOVA formulates perception, estimation, and control entirely in the target's reference frame.<n>We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss.
arXiv Detail & Related papers (2025-06-23T14:28:30Z)
Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective [54.91271106816616]
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency.<n>We propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives.<n> Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps.<n>For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities.<n>For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework.
arXiv Detail & Related papers (2025-05-07T19:37:20Z)
ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving [7.037019489455008]
We propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales. Experiments show that ZFusion achieved the state-of-the-art mAP in the region of interest.
arXiv Detail & Related papers (2025-04-04T13:29:32Z)
MinkOcc: Towards real-time label-efficient semantic occupancy prediction [8.239334282982623]
MinkOcc is a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs. It reduces reliance on manual labeling by 90% while maintaining competitive accuracy. We aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.
arXiv Detail & Related papers (2025-04-03T04:31:56Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
L2RDaS: Synthesizing 4D Radar Tensors for Model Generalization via Dataset Expansion [6.605694475813286]
We propose LiDAR-to-4D radar data synthesis (L2RDaS), a framework that synthesizes spatially informative 4D radar tensors from LiDAR data available in autonomous driving datasets.<n>L2RDaS integrates a modified U-Net architecture to effectively capture spatial information and an object information supplement (OBIS) module to enhance reflection fidelity.<n>L2RDaS improves model generalization by expanding real datasets with synthetic radar tensors, achieving an average increase of 4.25% in $AP_BEV$ and 2.87% in $AP_3D
arXiv Detail & Related papers (2025-03-05T16:16:46Z)
RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection [68.99784784185019]
Poor lighting or adverse weather conditions degrade camera performance. Radar suffers from noise and positional ambiguity. We propose RobuRCDet, a robust object detection model in BEV.
arXiv Detail & Related papers (2025-02-18T17:17:38Z)
Bayesian Approximation-Based Trajectory Prediction and Tracking with 4D Radar [13.438311878715536]
3D multi-object tracking (MOT) is vital for autonomous vehicles, yet LiDAR and camera-based methods degrade in adverse weather. We propose Bayes-4DRTrack, a 4D Radar-based MOT framework that adopts a transformer-based motion prediction network.
arXiv Detail & Related papers (2025-02-03T13:49:21Z)
Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception [9.76463525667238]
We propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction.<n>Code and models will be publicly available.
arXiv Detail & Related papers (2025-01-26T04:24:07Z)
MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation [8.113965240054506]
We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction.<n>HVFR improves performance by enhancing features for critical voxels, reducing computational cost.<n>MOD introduces an occluded' class to better handle regions obscured from sensor view, improving accuracy.<n>PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism.
arXiv Detail & Related papers (2024-12-29T14:39:21Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera [53.20087549782785]
We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions.
arXiv Detail & Related papers (2024-10-14T19:14:49Z)
RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar [15.776076554141687]
3D occupancy-based perception pipeline has significantly advanced autonomous driving. Current methods rely on LiDAR or camera inputs for 3D occupancy prediction. We introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction.
arXiv Detail & Related papers (2024-05-22T21:48:17Z)
OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision. We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z)
4DRVO-Net: Deep 4D Radar-Visual Odometry Using Multi-Modal and Multi-Scale Adaptive Fusion [2.911052912709637]
Four-dimensional (4D) radar--visual odometry (4DRVO) integrates complementary information from 4D radar and cameras. 4DRVO may exhibit significant tracking errors owing to sparsity of 4D radar point clouds. We present 4DRVO-Net, which is a method for 4D radar--visual odometry.
arXiv Detail & Related papers (2023-08-12T14:00:09Z)
DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving. We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z)
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR. fusing these two modalities can significantly boost the performance of 3D perception models. We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.