Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking
- URL: http://arxiv.org/abs/2510.10287v1
- Date: Sat, 11 Oct 2025 17:01:42 GMT
- Title: Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking
- Authors: Markus Käppeler, Özgün Çiçek, Daniele Cattaneo, Claudius Gläser, Yakov Miron, Abhinav Valada,
- Abstract summary: Camera-based 3D object detection and tracking are essential for perception in autonomous driving.<n>Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features.<n>We propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features.
- Score: 16.90910171943142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .
Related papers
- DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection [3.526990431236137]
Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features.<n>We propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception.
arXiv Detail & Related papers (2024-05-17T07:04:29Z) - Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps [13.524499163234342]
We propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map.
We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
arXiv Detail & Related papers (2024-02-21T14:50:24Z) - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view
3D Object Detection [46.92706423094971]
We propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features.
We also propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature.
Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-21T10:28:19Z) - VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view
Attention for Multi-view 3D Object Detection [47.926010021559314]
transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks.
However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors.
We propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view.
arXiv Detail & Related papers (2023-04-03T15:00:36Z) - Understanding the Robustness of 3D Object Detection with Bird's-Eye-View
Representations in Autonomous Driving [31.98600806479808]
Bird's-Eye-View (BEV) representations have significantly improved the performance of 3D detectors with camera inputs on popular benchmarks.
We evaluate the natural and adversarial robustness of various representative models under extensive settings.
We propose a 3D consistent patch attack by applying adversarial patches in thetemporal 3D space to guarantee the consistency.
arXiv Detail & Related papers (2023-03-30T11:16:58Z) - Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [84.94140661523956]
We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes.
We model each point in the 3D space by summing its projected features on the three planes.
Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
arXiv Detail & Related papers (2023-02-15T17:58:10Z) - OA-DET3D: Embedding Object Awareness as a General Plug-in for Multi-Camera 3D Object Detection [77.43427778037203]
We introduce OA-DET3D, a plug-in module that improves 3D object detection.<n> OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points.<n>We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of OA-DET3D.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.