Language-conditioned Detection Transformer
- URL: http://arxiv.org/abs/2311.17902v1
- Date: Wed, 29 Nov 2023 18:53:47 GMT
- Title: Language-conditioned Detection Transformer
- Authors: Jang Hyun Cho, Philipp Kr\"ahenb\"uhl
- Abstract summary: Our framework uses both image-level labels and detailed detection annotations when available.
We first train a language-conditioned object detector on fully-supervised detection data.
We use this detector to pseudo-label images with image-level labels.
Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images.
- Score: 4.8951183832371
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new open-vocabulary detection framework. Our framework uses both
image-level labels and detailed detection annotations when available. Our
framework proceeds in three steps. We first train a language-conditioned object
detector on fully-supervised detection data. This detector gets to see the
presence or absence of ground truth classes during training, and conditions
prediction on the set of present classes. We use this detector to pseudo-label
images with image-level labels. Our detector provides much more accurate
pseudo-labels than prior approaches with its conditioning mechanism. Finally,
we train an unconditioned open-vocabulary detector on the pseudo-annotated
images. The resulting detector, named DECOLA, shows strong zero-shot
performance in open-vocabulary LVIS benchmark as well as direct zero-shot
transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA
outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS
benchmark. DECOLA achieves state-of-the-art results in various model sizes,
architectures, and datasets by only training on open-sourced data and
academic-scale computing. Code is available at
https://github.com/janghyuncho/DECOLA.
Related papers
- Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification [19.850063789903846]
Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining.
Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships.
Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements.
arXiv Detail & Related papers (2024-09-01T11:39:13Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Region-centric Image-Language Pretraining for Open-Vocabulary Detection [39.17829005627821]
We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
arXiv Detail & Related papers (2023-09-29T21:56:37Z) - Augmenting Zero-Shot Detection Training with Image Labels [0.0]
Zero-shot detection (ZSD) is essential for real world detection use-cases, but remains a difficult task.
Recent research attempts ZSD with detection models that output embeddings instead of direct class labels.
We address this challenge by leveraging the CLIP embedding space in combination with image labels from ImageNet.
arXiv Detail & Related papers (2023-06-12T07:06:01Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Detecting Twenty-thousand Classes using Image-level Supervision [40.948910656287865]
We propose Detic, which expands the vocabulary of detectors to tens of thousands of concepts.
Unlike prior work, Detic does not assign image labels to boxes based on model predictions.
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset.
arXiv Detail & Related papers (2022-01-07T18:57:19Z) - Zero-Shot Detection via Vision and Language Knowledge Distillation [28.54818724798105]
We propose ViLD, a training method via Vision and Language knowledge Distillation.
We distill the knowledge from a pre-trained zero-shot image classification model into a two-stage detector.
Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model.
arXiv Detail & Related papers (2021-04-28T17:58:57Z) - Self-Supervised Person Detection in 2D Range Data using a Calibrated
Camera [83.31666463259849]
We propose a method to automatically generate training labels (called pseudo-labels) for 2D LiDAR-based person detectors.
We show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained using manual annotations.
Our method is an effective way to improve person detectors during deployment without any additional labeling effort.
arXiv Detail & Related papers (2020-12-16T12:10:04Z) - Dense Label Encoding for Boundary Discontinuity Free Rotation Detection [69.75559390700887]
This paper explores a relatively less-studied methodology based on classification.
We propose new techniques to push its frontier in two aspects.
Experiments and visual analysis on large-scale public datasets for aerial images show the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-19T05:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.