Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection
- URL: http://arxiv.org/abs/2211.00849v2
- Date: Sat, 29 Jul 2023 17:46:25 GMT
- Title: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection
- Authors: Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu,
Xiaodan Liang
- Abstract summary: We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
- Score: 87.39089806069707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the success of vision-language methods (VLMs) in zero-shot
classification, recent works attempt to extend this line of work into object
detection by leveraging the localization ability of pre-trained VLMs and
generating pseudo labels for unseen classes in a self-training manner. However,
since the current VLMs are usually pre-trained with aligning sentence embedding
with global image embedding, the direct use of them lacks fine-grained
alignment for object instances, which is the core of detection. In this paper,
we propose a simple but effective fine-grained Visual-Text Prompt-driven
self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces
a fine-grained visual-text prompt adapting stage to enhance the current
self-training paradigm with a more powerful fine-grained alignment. During the
adapting stage, we enable VLM to obtain fine-grained alignment by using
learnable text prompts to resolve an auxiliary dense pixel-wise prediction
task. Furthermore, we propose a visual prompt module to provide the prior task
information (i.e., the categories need to be predicted) for the vision branch
to better adapt the pre-trained VLM to the downstream tasks. Experiments show
that our method achieves the state-of-the-art performance for open-vocabulary
object detection, e.g., 31.5% mAP on unseen classes of COCO.
Related papers
- MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - LAMM: Label Alignment for Multi-Modal Prompt Learning [17.478967970736115]
We introduce an innovative label alignment method named textbfLAMM, which can adjust the category embeddings of downstream datasets through end-to-end training.
Our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios.
Our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods.
arXiv Detail & Related papers (2023-12-13T15:29:52Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - Learning to Prompt for Open-Vocabulary Object Detection with
Vision-Language Model [34.85604521903056]
We introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection.
We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector.
Experimental results show that our DetPro outperforms the baseline ViLD in all settings.
arXiv Detail & Related papers (2022-03-28T17:50:26Z) - Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
Concepts [14.808701042367401]
We argue that the use of object detection may not be suitable for vision language pre-training.
This paper proposes a new method called X-VLM to perform multi-grained vision language pre-training'
arXiv Detail & Related papers (2021-11-16T07:55:26Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.