Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
- URL: http://arxiv.org/abs/2410.19635v1
- Date: Fri, 25 Oct 2024 15:38:24 GMT
- Title: Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
- Authors: Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng,
- Abstract summary: Recent vision foundation models can extract universal representations and show impressive abilities in various tasks.
We show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection.
- Score: 47.18069715855738
- License:
- Abstract: Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector's decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector's encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector's performance while preventing the problems caused by the architecture discrepancy between the detector's backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector's backbone.
Related papers
- VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models [21.186456742407007]
We propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det.
Our model improves the baseline approach by $+5.1%$, $+6.2%$ on the $AP_0.5$, $AP_0.75$ metrics, respectively.
arXiv Detail & Related papers (2024-08-23T12:39:02Z) - DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM [81.75988648572347]
We present DetToolChain, a novel prompting paradigm to unleash the zero-shot object detection ability of multimodal large language models (MLLMs)
Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts.
We show that GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS Novel class set for open-vocabulary detection.
arXiv Detail & Related papers (2024-03-19T06:54:33Z) - InstaGen: Enhancing Object Detection by Training on Synthetic Dataset [59.445498550159755]
We present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance.
We integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images.
We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer.
arXiv Detail & Related papers (2024-02-08T18:59:53Z) - Tucker Bilinear Attention Network for Multi-scale Remote Sensing Object
Detection [10.060030309684953]
Large-scale variation of remote-sensing targets is one of main challenges in VHR remote-sensing object detection.
This paper proposes two novel modules: Guided Attention and Tucker Bilinear Attention.
Based on two modules, we build a new multi-scale remote sensing object detection framework.
arXiv Detail & Related papers (2023-03-09T15:20:03Z) - DEYO: DETR with YOLO for Step-by-Step Object Detection [0.0]
This paper proposes a new two-stage object detection model, named DETR with YOLO (DEYO)
The first stage provides high-quality query and anchor feeding into the second stage, improving the performance and efficiency of the second stage compared to the original DETR model.
Experiments demonstrate that DEYO attains 50.6 AP and 52.1 AP in 12 and 36 epochs, respectively.
arXiv Detail & Related papers (2022-11-12T06:36:17Z) - Decoupling Object Detection from Human-Object Interaction Recognition [37.133695677465376]
DEFR is a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose.
We propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts.
arXiv Detail & Related papers (2021-12-13T03:01:49Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Robust and Accurate Object Detection via Adversarial Learning [111.36192453882195]
This work augments the fine-tuning stage for object detectors by exploring adversarial examples.
Our approach boosts the performance of state-of-the-art EfficientDets by +1.1 mAP on the object detection benchmark.
arXiv Detail & Related papers (2021-03-23T19:45:26Z) - Condensing Two-stage Detection with Automatic Object Key Part Discovery [87.1034745775229]
Two-stage object detectors generally require excessively large models for their detection heads to achieve high accuracy.
We propose that the model parameters of two-stage detection heads can be condensed and reduced by concentrating on object key parts.
Our proposed technique consistently maintains original performance while waiving around 50% of the model parameters of common two-stage detection heads.
arXiv Detail & Related papers (2020-06-10T01:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.