Three ways to improve feature alignment for open vocabulary detection
- URL: http://arxiv.org/abs/2303.13518v1
- Date: Thu, 23 Mar 2023 17:59:53 GMT
- Title: Three ways to improve feature alignment for open vocabulary detection
- Authors: Relja Arandjelovi\'c, Alex Andonian, Arthur Mensch, Olivier J.
H\'enaff, Jean-Baptiste Alayrac, Andrew Zisserman
- Abstract summary: Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
- Score: 88.65076922242184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The core problem in zero-shot open vocabulary detection is how to align
visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch,
which breaks the vision-text feature alignment established during pretraining,
and struggles to prevent the language model from forgetting unseen classes.
We propose three methods to alleviate these issues. Firstly, a simple scheme
is used to augment the text embeddings which prevents overfitting to a small
number of classes seen during training, while simultaneously saving memory and
computation. Secondly, the feature pyramid network and the detection head are
modified to include trainable gated shortcuts, which encourages vision-text
feature alignment and guarantees it at the start of detection training.
Finally, a self-training approach is used to leverage a larger corpus of
image-text pairs thus improving detection performance on classes with no human
annotated bounding boxes.
Our three methods are evaluated on the zero-shot version of the LVIS
benchmark, each of them showing clear and significant benefits. Our final
network achieves the new stateof-the-art on the mAP-all metric and demonstrates
competitive performance for mAP-rare, as well as superior transfer to COCO and
Objects365.
Related papers
- Region-centric Image-Language Pretraining for Open-Vocabulary Detection [39.17829005627821]
We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
arXiv Detail & Related papers (2023-09-29T21:56:37Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Learning to Prompt for Open-Vocabulary Object Detection with
Vision-Language Model [34.85604521903056]
We introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection.
We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector.
Experimental results show that our DetPro outperforms the baseline ViLD in all settings.
arXiv Detail & Related papers (2022-03-28T17:50:26Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Towards Open Vocabulary Object Detection without Human-provided Bounding
Boxes [74.24276505126932]
We propose an open vocabulary detection framework that can be trained without manually provided bounding-box annotations.
Our method achieves this by leveraging the localization ability of pre-trained vision-language models.
arXiv Detail & Related papers (2021-11-18T00:05:52Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z) - ReADS: A Rectified Attentional Double Supervised Network for Scene Text
Recognition [22.367624178280682]
We elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition.
The ReADS can be trained end-to-end and only word-level annotations are required.
arXiv Detail & Related papers (2020-04-05T02:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.