F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language
Models
- URL: http://arxiv.org/abs/2209.15639v1
- Date: Fri, 30 Sep 2022 17:59:52 GMT
- Title: F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language
Models
- Authors: Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
- Abstract summary: We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining.
- Score: 54.21757555804668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present F-VLM, a simple open-vocabulary object detection method built upon
Frozen Vision and Language Models. F-VLM simplifies the current multi-stage
training pipeline by eliminating the need for knowledge distillation or
detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1)
retains the locality-sensitive features necessary for detection, and 2) is a
strong region classifier. We finetune only the detector head and combine the
detector and VLM outputs for each region at inference time. F-VLM shows
compelling scaling behavior and achieves +6.5 mask AP improvement over the
previous state of the art on novel categories of LVIS open-vocabulary detection
benchmark. In addition, we demonstrate very competitive results on COCO
open-vocabulary detection benchmark and cross-dataset transfer detection, in
addition to significant training speed-up and compute savings. Code will be
released.
Related papers
- MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM [81.75988648572347]
We present DetToolChain, a novel prompting paradigm to unleash the zero-shot object detection ability of multimodal large language models (MLLMs)
Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts.
We show that GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS Novel class set for open-vocabulary detection.
arXiv Detail & Related papers (2024-03-19T06:54:33Z) - Zero-shot Nuclei Detection via Visual-Language Pre-trained Models [19.526504045149895]
Large-scale visual-language pre-trained models have proven their excellent performance in downstream object detection for natural scenes.
However, zero-shot nuclei detection on H&E images via VLPMs remains underexplored.
In this paper, we attempt to explore the potential of the object-level VLPM, Grounded Language-Image Pre-training model, for zero-shot nuclei detection.
arXiv Detail & Related papers (2023-06-30T13:44:13Z) - CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting
and Anchor Pre-Matching [36.31910430275781]
We propose a framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching.
CORA achieves 41.7 AP50 on the COCO OVD benchmark, and 28.1 box APr on the LVIS OVD benchmark.
arXiv Detail & Related papers (2023-03-23T07:13:57Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.