Improving Zero-shot Generalization and Robustness of Multi-modal Models
- URL: http://arxiv.org/abs/2212.01758v2
- Date: Thu, 25 May 2023 17:14:50 GMT
- Title: Improving Zero-shot Generalization and Robustness of Multi-modal Models
- Authors: Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang,
Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao
- Abstract summary: Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
- Score: 70.14692320804178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal image-text models such as CLIP and LiT have demonstrated
impressive performance on image classification benchmarks and their zero-shot
generalization ability is particularly exciting. While the top-5 zero-shot
accuracies of these models are very high, the top-1 accuracies are much lower
(over 25% gap in some cases). We investigate the reasons for this performance
gap and find that many of the failure cases are caused by ambiguity in the text
prompts. First, we develop a simple and efficient zero-shot post-hoc method to
identify images whose top-1 prediction is likely to be incorrect, by measuring
consistency of the predictions w.r.t. multiple prompts and image
transformations. We show that our procedure better predicts mistakes,
outperforming the popular max logit baseline on selective prediction tasks.
Next, we propose a simple and efficient way to improve accuracy on such
uncertain images by making use of the WordNet hierarchy; specifically we
augment the original class by incorporating its parent and children from the
semantic label hierarchy, and plug the augmentation into text prompts. We
conduct experiments on both CLIP and LiT models with five different
ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by
17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set.
We also show that our method improves across ImageNet shifted datasets, four
other datasets, and other model architectures such as LiT. The proposed method
is hyperparameter-free, requires no additional model training and can be easily
scaled to other large multi-modal architectures. Code is available at
https://github.com/gyhandy/Hierarchy-CLIP.
Related papers
- Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting [55.361337202198925]
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
We propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data.
arXiv Detail & Related papers (2024-10-25T04:00:45Z) - What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models [11.683093317651517]
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification.
We present a simple yet effective approach for zero-shot image classification using multimodal LLMs.
Our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets.
arXiv Detail & Related papers (2024-05-24T16:05:15Z) - Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification [1.7265013728931]
This paper introduces a novel framework for zero-shot learning (ZSL) to recognize new categories that are unseen during training.
We propose three strategies to enhance the model's performance to handle ZSL.
arXiv Detail & Related papers (2024-05-03T15:02:41Z) - Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models [37.574691902971296]
We propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models.
We show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k.
arXiv Detail & Related papers (2023-06-08T15:20:27Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Core Risk Minimization using Salient ImageNet [53.616101711801484]
We introduce the Salient Imagenet dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes.
Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features.
Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features.
arXiv Detail & Related papers (2022-03-28T01:53:34Z) - SimMIM: A Simple Framework for Masked Image Modeling [29.015777125540613]
This paper presents Sim, a simple framework for masked image modeling.
We study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance.
We also leverage this approach to facilitate the training of a 3B model, that by $40times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.