Multi-Modal Classifiers for Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2306.05493v1
- Date: Thu, 8 Jun 2023 18:31:56 GMT
- Title: Multi-Modal Classifiers for Open-Vocabulary Object Detection
- Authors: Prannay Kaul, Weidi Xie, Andrew Zisserman
- Abstract summary: The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
- Score: 104.77331131447541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is open-vocabulary object detection (OVOD)
$\unicode{x2013}$ building a model that can detect objects beyond the set of
categories seen at training, thus enabling the user to specify categories of
interest at inference without the need for model retraining. We adopt a
standard two-stage object detector architecture, and explore three ways for
specifying novel categories: via language descriptions, via image exemplars, or
via a combination of the two. We make three contributions: first, we prompt a
large language model (LLM) to generate informative language descriptions for
object classes, and construct powerful text-based classifiers; second, we
employ a visual aggregator on image exemplars that can ingest any number of
images as input, forming vision-based classifiers; and third, we provide a
simple method to fuse information from language descriptions and image
exemplars, yielding a multi-modal classifier. When evaluating on the
challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our
text-based classifiers outperform all previous OVOD works; (ii) our
vision-based classifiers perform as well as text-based classifiers in prior
work; (iii) using multi-modal classifiers perform better than either modality
alone; and finally, (iv) our text-based and multi-modal classifiers yield
better performance than a fully-supervised detector.
Related papers
- OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization [35.39571632348391]
Few-shot learning aims to learn representations that can tackle novel tasks.
Recent studies show that cross-modal learning can improve representations for few-shot classification.
Language is a rich modality that can be used to guide visual learning.
arXiv Detail & Related papers (2024-05-29T04:29:12Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.