Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark
- URL: http://arxiv.org/abs/2503.14862v2
- Date: Thu, 20 Mar 2025 04:44:21 GMT
- Title: Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark
- Authors: Ying Liu, Yijing Hua, Haojiang Chai, Yanbo Wang, TengQi Ye,
- Abstract summary: 3F-OVD is a novel task that extends supervised fine-grained object detection to the open-vocabulary setting.<n>We benchmark state-of-the-art object detectors on our dataset for both settings.
- Score: 6.93847426808971
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.
Related papers
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning [13.667326007851674]
We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-11-20T10:26:04Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
arXiv Detail & Related papers (2023-04-03T08:22:02Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.