GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning
- URL: http://arxiv.org/abs/2303.09252v1
- Date: Thu, 16 Mar 2023 12:06:02 GMT
- Title: GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning
- Authors: Jiayi Lin, Shaogang Gong
- Abstract summary: One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning.
Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
- Score: 55.77244064907146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A vision-language foundation model pretrained on very large-scale image-text
paired data has the potential to provide generalizable knowledge representation
for downstream visual recognition and detection tasks, especially on
supplementing the undersampled categories in downstream model training. Recent
studies utilizing CLIP for object detection have shown that a two-stage
detector design typically outperforms a one-stage detector, while requiring
more expensive training resources and longer inference time. In this work, we
propose a one-stage detector GridCLIP that narrows its performance gap to those
of two-stage detectors, with approximately 43 and 5 times faster than its
two-stage counterpart (ViLD) in the training and test process respectively.
GridCLIP learns grid-level representations to adapt to the intrinsic principle
of one-stage detection learning by expanding the conventional CLIP image-text
holistic mapping to a more fine-grained, grid-text alignment. This differs from
the region-text mapping in two-stage detectors that apply CLIP directly by
treating regions as images. Specifically, GridCLIP performs Grid-level
Alignment to adapt the CLIP image-level representations to grid-level
representations by aligning to CLIP category representations to learn the
annotated (especially frequent) categories. To learn generalizable visual
representations of broader categories, especially undersampled ones, we perform
Image-level Alignment during training to propagate broad pre-learned categories
in the CLIP image encoder from the image-level to the grid-level
representations. Experiments show that the learned CLIP-based grid-level
representations boost the performance of undersampled (infrequent and novel)
categories, reaching comparable detection performance on the LVIS benchmark.
Related papers
- C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.
Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation [44.450243388665776]
We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation.
Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
arXiv Detail & Related papers (2024-03-19T05:27:04Z) - Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation [20.880942041889444]
We propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel.
Specifically, we introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers.
We demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.
arXiv Detail & Related papers (2023-12-20T04:27:13Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image
Classification [23.392746466420128]
This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification.
We take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation.
Our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets.
arXiv Detail & Related papers (2023-07-31T13:12:02Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.