Toward Open Vocabulary Aerial Object Detection with CLIP-Activated
Student-Teacher Learning
- URL: http://arxiv.org/abs/2311.11646v2
- Date: Wed, 13 Mar 2024 13:42:38 GMT
- Title: Toward Open Vocabulary Aerial Object Detection with CLIP-Activated
Student-Teacher Learning
- Authors: Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou,
Wenxian Yu
- Abstract summary: We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
- Score: 14.35268391981271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An increasingly massive number of remote-sensing images spurs the development
of extensible object detectors that can detect objects beyond training
categories without costly collecting new labeled data. In this paper, we aim to
develop open-vocabulary object detection (OVD) technique in aerial images that
scales up object vocabulary size beyond training data. The fundamental
challenges hinder open vocabulary object detection performance: the qualities
of the class-agnostic region proposals and the pseudo-labels that can
generalize well to novel object categories. To simultaneously generate
high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated
student-teacher open-vocabulary object Detection framework. Our end-to-end
framework following the student-teacher self-learning mechanism employs the
RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing
so, our approach boosts not only novel object proposals but also
classification. Furthermore, we devise a dynamic label queue strategy to
maintain high-quality pseudo labels during batch training. We conduct extensive
experiments on multiple existing aerial object detection datasets, which are
set up for the OVD task. Experimental results demonstrate our CastDet achieving
superior open-vocabulary detection performance, e.g., reaching 40.5\% mAP,
which outperforms previous methods Detic/ViLD by 23.7%/14.9% on the VisDroneZSD
dataset. To our best knowledge, this is the first work to apply and develop the
open-vocabulary object detection technique for aerial images.
Related papers
- Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories.
Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection.
We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Improved Region Proposal Network for Enhanced Few-Shot Object Detection [23.871860648919593]
Few-shot object detection (FSOD) methods have emerged as a solution to the limitations of classic object detection approaches.
We develop a semi-supervised algorithm to detect and then utilize unlabeled novel objects as positive samples during the FSOD training stage.
Our improved hierarchical sampling strategy for the region proposal network (RPN) also boosts the perception of the object detection model for large objects.
arXiv Detail & Related papers (2023-08-15T02:35:59Z) - Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Incremental-DETR: Incremental Few-Shot Object Detection via
Self-Supervised Learning [60.64535309016623]
We propose the Incremental-DETR that does incremental few-shot object detection via fine-tuning and self-supervised learning on the DETR object detector.
To alleviate severe over-fitting with few novel class data, we first fine-tune the class-specific components of DETR with self-supervision.
We further introduce a incremental few-shot fine-tuning strategy with knowledge distillation on the class-specific components of DETR to encourage the network in detecting novel classes without catastrophic forgetting.
arXiv Detail & Related papers (2022-05-09T05:08:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.