Related papers: Simple Open-Vocabulary Object Detection with Vision Transformers

Simple Open-Vocabulary Object Detection with Vision Transformers

URL: http://arxiv.org/abs/2205.06230v1
Date: Thu, 12 May 2022 17:20:36 GMT
Title: Simple Open-Vocabulary Object Detection with Vision Transformers
Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
Abstract summary: We propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
Score: 51.57562920090721
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

Related papers

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting [1.1871535995163365]
Textual Inversion (TI) allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples.<n>The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning.<n>We evaluate whether the method matches or outperforms the baseline methods that suffer from forgetting in a variety of quantitative and qualitative experiments.
arXiv Detail & Related papers (2025-08-07T12:28:08Z)
Adapting Vision Transformer for Efficient Change Detection [36.86012953467539]
We propose an efficient tuning approach that involves freezing the parameters of the pretrained image encoder and introducing additional training parameters. We have achieved competitive or even better results while maintaining extremely low resource consumption across six change detection benchmarks.
arXiv Detail & Related papers (2023-12-08T07:09:03Z)
Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision. Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes. We introduce the bilevel paradigm to model the above latent correspondence. A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z)
SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language Grounding [5.715548995729382]
We propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes. Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects.
arXiv Detail & Related papers (2022-08-15T19:00:56Z)
Scaling Novel Object Detection with Weakly Supervised Detection Transformers [21.219817483091166]
We propose the Weakly Supervised Detection Transformer, which enables efficient knowledge transfer from a large-scale pretraining dataset to WSOD finetuning. Our experiments show that our approach outperforms previous state-of-the-art models on large-scale novel object detection datasets.
arXiv Detail & Related papers (2022-07-11T21:45:54Z)
Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization. We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning. Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.