Zero-Shot Detection via Vision and Language Knowledge Distillation
- URL: http://arxiv.org/abs/2104.13921v1
- Date: Wed, 28 Apr 2021 17:58:57 GMT
- Title: Zero-Shot Detection via Vision and Language Knowledge Distillation
- Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
- Abstract summary: We propose ViLD, a training method via Vision and Language knowledge Distillation.
We distill the knowledge from a pre-trained zero-shot image classification model into a two-stage detector.
Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model.
- Score: 28.54818724798105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot image classification has made promising progress by training the
aligned image and text encoders. The goal of this work is to advance zero-shot
object detection, which aims to detect novel objects without bounding box nor
mask annotations. We propose ViLD, a training method via Vision and Language
knowledge Distillation. We distill the knowledge from a pre-trained zero-shot
image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask
R-CNN). Our method aligns the region embeddings in the detector to the text and
image embeddings inferred by the pre-trained model. We use the text embeddings
as the detection classifier, obtained by feeding category names into the
pre-trained text encoder. We then minimize the distance between the region
embeddings and image embeddings, obtained by feeding region proposals into the
pre-trained image encoder. During inference, we include text embeddings of
novel categories into the detection classifier for zero-shot detection. We
benchmark the performance on LVIS dataset by holding out all rare categories as
novel categories. ViLD obtains 16.1 mask AP$_r$ with a Mask R-CNN (ResNet-50
FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8.
The model can directly transfer to other datasets, achieving 72.2 AP$_{50}$,
36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.
Related papers
- Joint Neural Networks for One-shot Object Recognition and Detection [5.389851588398047]
This paper presents a novel joint neural networks approach to address the challenging one-shot object recognition and detection tasks.
Inspired by Siamese neural networks and state-of-art multi-box detection approaches, the joint neural networks are able to perform object recognition and detection for categories that remain unseen during the training process.
The proposed approach achieves 61.41% accuracy for one-shot object recognition on the MiniImageNet dataset and 47.1% mAP for one-shot object detection when trained on the dataset and tested.
arXiv Detail & Related papers (2024-08-01T16:48:03Z) - Language-conditioned Detection Transformer [4.8951183832371]
Our framework uses both image-level labels and detailed detection annotations when available.
We first train a language-conditioned object detector on fully-supervised detection data.
We use this detector to pseudo-label images with image-level labels.
Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images.
arXiv Detail & Related papers (2023-11-29T18:53:47Z) - Image-free Classifier Injection for Zero-Shot Classification [72.66409483088995]
Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training.
We aim to equip pre-trained models with zero-shot classification capabilities without the use of image data.
We achieve this with our proposed Image-free Injection with Semantics (ICIS)
arXiv Detail & Related papers (2023-08-21T09:56:48Z) - Read, look and detect: Bounding box annotation from image-caption pairs [2.0305676256390934]
We propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs.
Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k COCO.
arXiv Detail & Related papers (2023-06-09T12:23:20Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - CapDet: Unifying Dense Captioning and Open-World Detection Pretraining [68.8382821890089]
We propose a novel open-world detector named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes.
Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head.
arXiv Detail & Related papers (2023-03-04T19:53:00Z) - Cut and Learn for Unsupervised Object Detection and Instance
Segmentation [65.43627672225624]
Cut-and-LEaRn (CutLER) is a simple approach for training unsupervised object detection and segmentation models.
CutLER is a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks.
arXiv Detail & Related papers (2023-01-26T18:57:13Z) - ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language
KnowledgeDistillation [5.424015823818208]
A dataset such as COCO is extensively annotated across many images but with a sparse number of categories and annotating all object classes across a diverse domain is expensive and challenging.
We develop a Vision-Language distillation method that aligns both image and text embeddings from a zero-shot pre-trained model such as CLIP to a modified semantic prediction head from a one-stage detector like YOLOv5.
During inference, our model can be adapted to detect any number of object classes without additional training.
arXiv Detail & Related papers (2021-09-24T16:46:36Z) - Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation [23.631184498984933]
Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
arXiv Detail & Related papers (2021-04-18T19:55:31Z) - DetCo: Unsupervised Contrastive Learning for Object Detection [64.22416613061888]
Unsupervised contrastive learning achieves great success in learning image representations with CNN.
We present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches.
DetCo consistently outperforms supervised method by 1.6/1.2/1.0 AP on Mask RCNN-C4/FPN/RetinaNet with 1x schedule.
arXiv Detail & Related papers (2021-02-09T12:47:20Z) - Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly
Supervised Object Detection [76.9756607002489]
We propose a novel webly supervised object detection (WebSOD) method for novel classes.
Our proposed method combines bottom-up and top-down cues for novel class detection.
We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits.
arXiv Detail & Related papers (2020-03-22T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.