Image-Caption Encoding for Improving Zero-Shot Generalization
- URL: http://arxiv.org/abs/2402.02662v1
- Date: Mon, 5 Feb 2024 01:14:07 GMT
- Title: Image-Caption Encoding for Improving Zero-Shot Generalization
- Authors: Eric Yang Yu, Christopher Liao, Sathvik Ravi, Theodoros Tsiligkaridis,
Brian Kulis
- Abstract summary: We show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes.
In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption (ICE) method.
Our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets.
- Score: 12.906307770270026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language models have combined contrastive
approaches with generative methods to achieve state-of-the-art (SOTA) on
downstream inference tasks like zero-shot image classification. However, a
persistent issue of these models for image classification is their
out-of-distribution (OOD) generalization capabilities. We first show that when
an OOD data point is misclassified, the correct class can be typically found in
the Top-K predicted classes. In order to steer the model prediction toward the
correct class within the top predicted classes, we propose the Image-Caption
Encoding (ICE) method, a straightforward approach that directly enforces
consistency between the image-conditioned and caption-conditioned predictions
at evaluation time only. Intuitively, we take advantage of unique properties of
the generated captions to guide our local search for the correct class label
within the Top-K predicted classes. We show that our method can be easily
combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on
average and up to 3% on challenging datasets. Our code:
https://github.com/Chris210634/ice
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification [1.7265013728931]
This paper introduces a novel framework for zero-shot learning (ZSL) to recognize new categories that are unseen during training.
We propose three strategies to enhance the model's performance to handle ZSL.
arXiv Detail & Related papers (2024-05-03T15:02:41Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Enhancing Self-Supervised Learning for Remote Sensing with Elevation
Data: A Case Study with Scarce And High Level Semantic Labels [1.534667887016089]
This work proposes a hybrid unsupervised and supervised learning method to pre-train models applied in Earth observation downstream tasks.
We combine a contrastive approach to pre-train models with a pixel-wise regression pre-text task to predict coarse elevation maps.
arXiv Detail & Related papers (2023-04-13T23:01:11Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.