VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models
- URL: http://arxiv.org/abs/2408.13031v1
- Date: Fri, 23 Aug 2024 12:39:02 GMT
- Title: VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models
- Authors: Wentao Wu, Fanghua Hong, Xiao Wang, Chenglong Li, Jin Tang,
- Abstract summary: We propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det.
Our model improves the baseline approach by $+5.1%$, $+6.2%$ on the $AP_0.5$, $AP_0.75$ metrics, respectively.
- Score: 21.186456742407007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle's semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using VehicleMAE. More importantly, we propose a new VAtt2Vec module that predicts the vehicle semantic attributes of these proposals and transforms them into feature vectors to enhance the vision features via contrastive learning. Extensive experiments on three vehicle detection benchmark datasets thoroughly proved the effectiveness of our vehicle detector. Specifically, our model improves the baseline approach by $+5.1\%$, $+6.2\%$ on the $AP_{0.5}$, $AP_{0.75}$ metrics, respectively, on the Cityscapes dataset.The source code of this work will be released at https://github.com/Event-AHU/VFM-Det.
Related papers
- Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models [47.18069715855738]
Recent vision foundation models can extract universal representations and show impressive abilities in various tasks.
We show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection.
arXiv Detail & Related papers (2024-10-25T15:38:24Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - Structural Information Guided Multimodal Pre-training for
Vehicle-centric Perception [36.92036421490819]
We propose a novel vehicle-centric pre-training framework called VehicleMAE.
We explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction.
A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information.
arXiv Detail & Related papers (2023-12-15T14:10:21Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - Blind-Spot Collision Detection System for Commercial Vehicles Using
Multi Deep CNN Architecture [0.17499351967216337]
Two convolutional neural networks (CNNs) based on high-level feature descriptors are proposed to detect blind-spot collisions for heavy vehicles.
A fusion approach is proposed to integrate two pre-trained networks for extracting high level features for blind-spot vehicle detection.
The fusion of features significantly improves the performance of faster R-CNN and outperformed the existing state-of-the-art methods.
arXiv Detail & Related papers (2022-08-17T11:10:37Z) - SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous
Driving [94.11868795445798]
We release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories.
To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes.
We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models.
arXiv Detail & Related papers (2021-06-21T13:55:57Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - What My Motion tells me about Your Pose: A Self-Supervised Monocular 3D
Vehicle Detector [41.12124329933595]
We demonstrate the use of monocular visual odometry for the self-supervised fine-tuning of a model for orientation estimation pre-trained on a reference domain.
We subsequently demonstrate an optimization-based monocular 3D bounding box detector built on top of the self-supervised vehicle orientation estimator.
arXiv Detail & Related papers (2020-07-29T12:58:40Z) - Vehicle Detection of Multi-source Remote Sensing Data Using Active
Fine-tuning Network [26.08837467340853]
The proposed Ms-AFt framework integrates transfer learning, segmentation, and active classification into a unified framework for auto-labeling and detection.
The proposed Ms-AFt employs a fine-tuning network to firstly generate a vehicle training set from an unlabeled dataset.
Extensive experimental results conducted on two open ISPRS benchmark datasets, demonstrate the superiority and effectiveness of the proposed Ms-AFt for vehicle detection.
arXiv Detail & Related papers (2020-07-16T17:46:46Z) - VehicleNet: Learning Robust Visual Representation for Vehicle
Re-identification [116.1587709521173]
We propose to build a large-scale vehicle dataset (called VehicleNet) by harnessing four public vehicle datasets.
We design a simple yet effective two-stage progressive approach to learning more robust visual representation from VehicleNet.
We achieve the state-of-art accuracy of 86.07% mAP on the private test set of AICity Challenge.
arXiv Detail & Related papers (2020-04-14T05:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.