Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2
- URL: http://arxiv.org/abs/2501.08118v1
- Date: Tue, 14 Jan 2025 13:51:14 GMT
- Title: Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2
- Authors: Seamie Hayes, Ganesh Sistu, CiarĂ¡n Eising,
- Abstract summary: We introduce an innovative application of Metric3Dv2's depth information as a PseudoLiDAR point cloud incorporated into the Simple-BEV architecture.<n>This integration results in a +3 IoU improvement compared to the Camera-only model.
- Score: 6.42131197643513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Birds Eye View perception models require extensive data to perform and generalize effectively. While traditional datasets often provide abundant driving scenes from diverse locations, this is not always the case. It is crucial to maximize the utility of the available training data. With the advent of large foundation models such as DINOv2 and Metric3Dv2, a pertinent question arises: can these models be integrated into existing model architectures to not only reduce the required training data but surpass the performance of current models? We choose two model architectures in the vehicle segmentation domain to alter: Lift-Splat-Shoot, and Simple-BEV. For Lift-Splat-Shoot, we explore the implementation of frozen DINOv2 for feature extraction and Metric3Dv2 for depth estimation, where we greatly exceed the baseline results by 7.4 IoU while utilizing only half the training data and iterations. Furthermore, we introduce an innovative application of Metric3Dv2's depth information as a PseudoLiDAR point cloud incorporated into the Simple-BEV architecture, replacing traditional LiDAR. This integration results in a +3 IoU improvement compared to the Camera-only model.
Related papers
- Enhancing Steering Estimation with Semantic-Aware GNNs [41.89219383258699]
hybrid architectures combine 3D neural network models with recurrent neural networks (RNNs) for temporal modeling.
We evaluate four hybrid 3D models, all of which outperform the 2D-only baseline.
We validate our approach on the KITTI dataset, achieving a 71% improvement over 2D-only models.
arXiv Detail & Related papers (2025-03-21T13:58:08Z) - Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters [32.21090169762889]
BALViT is a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders.
We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.
arXiv Detail & Related papers (2025-03-05T09:30:49Z) - YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary [12.39040757106137]
We introduce an innovative em textbfRetriever-emtextbfDictionary (RD) module to address this issue.
This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset.
arXiv Detail & Related papers (2024-10-20T09:38:58Z) - POINTS: Improving Your Vision-language Model with Affordable Strategies [28.611705477757454]
We train a robust baseline model using latest advancements in vision-language models.
We filter pre-training data using perplexity, selecting the lowest perplexity data for training.
During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.
arXiv Detail & Related papers (2024-09-07T13:41:37Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models [1.1965844936801797]
Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots.
We present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds.
Our method is built upon denoising diffusion probabilistic models (DDPMs), which have shown impressive results among generative model frameworks.
arXiv Detail & Related papers (2023-09-17T12:26:57Z) - Interpretable 2D Vision Models for 3D Medical Images [47.75089895500738]
This study proposes a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D images.
We show on all 3D MedMNIST datasets as benchmark and two real-world datasets consisting of several hundred high-resolution CT or MRI scans that our approach performs on par with existing methods.
arXiv Detail & Related papers (2023-07-13T08:27:09Z) - Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic
Segmentation [6.326177388323946]
We develop an effective 3D-to-BEV knowledge distillation method that transfers rich knowledge from 3D voxel-based models to BEV-based models.
Our framework mainly consists of two modules: the voxel-to-pillar distillation module and the label-weight distillation module.
Label-weight distillation helps the model pay more attention to regions with more height information.
arXiv Detail & Related papers (2023-04-22T13:03:19Z) - Data-Free Adversarial Knowledge Distillation for Graph Neural Networks [62.71646916191515]
We propose the first end-to-end framework for data-free adversarial knowledge distillation on graph structured data (DFAD-GNN)
To be specific, our DFAD-GNN employs a generative adversarial network, which mainly consists of three components: a pre-trained teacher model and a student model are regarded as two discriminators, and a generator is utilized for deriving training graphs to distill knowledge from the teacher model into the student model.
Our DFAD-GNN significantly surpasses state-of-the-art data-free baselines in the graph classification task.
arXiv Detail & Related papers (2022-05-08T08:19:40Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z) - Model Reuse with Reduced Kernel Mean Embedding Specification [70.044322798187]
We present a two-phase framework for finding helpful models for a current application.
In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model.
Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification.
arXiv Detail & Related papers (2020-01-20T15:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.