Zero-Shot Visual Classification with Guided Cropping
- URL: http://arxiv.org/abs/2309.06581v1
- Date: Tue, 12 Sep 2023 20:09:12 GMT
- Title: Zero-Shot Visual Classification with Guided Cropping
- Authors: Piyapat Saranrittichai, Mauricio Munoz, Volker Fischer and Chaithanya
Kumar Mummadi
- Abstract summary: We propose an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest.
We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.
- Score: 9.321383320998262
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pretrained vision-language models, such as CLIP, show promising zero-shot
performance across a wide variety of datasets. For closed-set classification
tasks, however, there is an inherent limitation: CLIP image encoders are
typically designed to extract generic image-level features that summarize
superfluous or confounding information for the target tasks. This results in
degradation of classification performance, especially when objects of interest
cover small areas of input images. In this work, we propose CLIP with Guided
Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection
model in a preprocessing step to increase focus of zero-shot classifier to the
object of interest and minimize influence of extraneous image regions. We
empirically show that our approach improves zero-shot classification results
across architectures and datasets, favorably for small objects.
Related papers
- TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning [30.506968671472517]
We introduce TRaining-Free Object-Part Enhancement (TROPE)
TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques.
Our evaluations show that TROPE consistently boosts performance across all tested zero-shot IC approaches and achieves state-of-the-art results on fine-grained IC datasets.
arXiv Detail & Related papers (2024-09-30T05:24:01Z) - Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection [4.0208298639821525]
Few-shot object detection, which focuses on detecting novel objects with few labels, is an emerging challenge in the community.
Recent studies show that adapting a pre-trained model or modified loss function can improve performance.
We propose Re-scoring using Image-language Similarity for Few-shot object detection (RISF) which extends Faster R-CNN.
arXiv Detail & Related papers (2023-11-01T04:04:34Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - A Low-Shot Object Counting Network With Iterative Prototype Adaptation [14.650207945870598]
We consider low-shot counting of arbitrary semantic categories in the image using only few annotated exemplars (few-shot) or no exemplars (no-shot)
Existing methods extract queries by feature pooling which neglects the shape information (e.g., size and aspect) and leads to a reduced object localization accuracy and count estimates.
We propose a Low-shot Object Counting network with iterative prototype Adaptation (LOCA)
arXiv Detail & Related papers (2022-11-15T15:39:23Z) - Injecting Image Details into CLIP's Feature Space [29.450159407113155]
We introduce an efficient framework that can produce a single feature representation for a high-resolution image.
In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method.
We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets.
arXiv Detail & Related papers (2022-08-31T06:18:10Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - Improving Few-shot Learning with Weakly-supervised Object Localization [24.3569501375842]
We propose a novel framework that generates class representations by extracting features from class-relevant regions of the images.
Our method outperforms the baseline few-shot model in miniImageNet and tieredImageNet benchmarks.
arXiv Detail & Related papers (2021-05-25T07:39:32Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - A Few-Shot Sequential Approach for Object Counting [63.82757025821265]
We introduce a class attention mechanism that sequentially attends to objects in the image and extracts their relevant features.
The proposed technique is trained on point-level annotations and uses a novel loss function that disentangles class-dependent and class-agnostic aspects of the model.
We present our results on a variety of object-counting/detection datasets, including FSOD and MS COCO.
arXiv Detail & Related papers (2020-07-03T18:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.