Open-Vocabulary Object Detection using Pseudo Caption Labels
- URL: http://arxiv.org/abs/2303.13040v1
- Date: Thu, 23 Mar 2023 05:10:22 GMT
- Title: Open-Vocabulary Object Detection using Pseudo Caption Labels
- Authors: Han-Cheol Cho, Won Young Jhoo, Wooyoung Kang, Byungseok Roh
- Abstract summary: We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
- Score: 3.260777306556596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent open-vocabulary detection methods aim to detect novel objects by
distilling knowledge from vision-language models (VLMs) trained on a vast
amount of image-text pairs. To improve the effectiveness of these methods,
researchers have utilized datasets with a large vocabulary that contains a
large number of object classes, under the assumption that such data will enable
models to extract comprehensive knowledge on the relationships between various
objects and better generalize to unseen object classes. In this study, we argue
that more fine-grained labels are necessary to extract richer knowledge about
novel objects, including object attributes and relationships, in addition to
their names. To address this challenge, we propose a simple and effective
method named Pseudo Caption Labeling (PCL), which utilizes an image captioning
model to generate captions that describe object instances from diverse
perspectives. The resulting pseudo caption labels offer dense samples for
knowledge distillation. On the LVIS benchmark, our best model trained on the
de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6,
comparable to the state-of-the-art performance. PCL's simplicity and
flexibility are other notable features, as it is a straightforward
pre-processing technique that can be used with any image captioning model
without imposing any restrictions on model architecture or training process.
Related papers
- Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.