LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors
- URL: http://arxiv.org/abs/2402.04630v1
- Date: Wed, 7 Feb 2024 07:26:49 GMT
- Title: LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors
- Authors: Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu
- Abstract summary: DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
- Score: 58.75140338866403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the outstanding zero-shot capability of vision language models
(VLMs) in image classification tasks, open-vocabulary object detection has
attracted increasing interest by distilling the broad VLM knowledge into
detector training. However, most existing open-vocabulary detectors learn by
aligning region embeddings with categorical labels (e.g., bicycle) only,
disregarding the capability of VLMs on aligning visual embeddings with
fine-grained text description of object parts (e.g., pedals and bells). This
paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that
introduces conditional context prompts and hierarchical textual descriptors
that enable precise region-text alignment as well as open-vocabulary detection
training in general. Specifically, the conditional context prompt transforms
regional embeddings into image-like representations that can be directly
integrated into general open vocabulary detection training. In addition, we
introduce large language models as an interactive and implicit knowledge
repository which enables iterative mining and refining visually oriented
textual descriptors for precise region-text alignment. Extensive experiments
over multiple large-scale benchmarks show that DVDet outperforms the
state-of-the-art consistently by large margins.
Related papers
- Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - OV-VG: A Benchmark for Open-Vocabulary Visual Grounding [33.02137080950678]
This research endeavor introduces novel and challenging open-vocabulary visual tasks.
The overarching aim is to establish connections between language descriptions and the localization of novel objects.
We have curated a benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images.
arXiv Detail & Related papers (2023-10-22T17:54:53Z) - Aligning Bag of Regions for Open-Vocabulary Object Detection [74.89762864838042]
We propose to align the embedding of bag of regions beyond individual regions.
The proposed method groups contextually interrelated regions as a bag.
Our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks.
arXiv Detail & Related papers (2023-02-27T17:39:21Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - Open-Vocabulary DETR with Conditional Matching [86.1530128487077]
OV-DETR is an open-vocabulary detector based on DETR.
It can detect any object given its class name or an exemplar image.
It achieves non-trivial improvements over current state of the arts.
arXiv Detail & Related papers (2022-03-22T16:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.