Context-Aware Robust Fine-Tuning
- URL: http://arxiv.org/abs/2211.16175v1
- Date: Tue, 29 Nov 2022 13:07:41 GMT
- Title: Context-Aware Robust Fine-Tuning
- Authors: Xiaofeng Mao, Yuefeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, Zhao Li
- Abstract summary: Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]"
Fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks.
We propose Context-Aware Robust Fine-tuning (CAR-FT) to solve this problem.
- Score: 23.027441849817922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability
of classifying an image belonging to "[CLASS]" by using similarity between the
image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive
text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g.
background, style, viewpoint, and exhibits unprecedented robustness against a
wide range of distribution shifts. However, recent works find further
fine-tuning of CLIP models improves accuracy but sacrifices the robustness on
downstream tasks. We conduct an empirical investigation to show fine-tuning
will corrupt the context-aware ability of pre-trained CLIP features. To solve
this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT
regularizes the model during fine-tuning to capture the context information.
Specifically, we use zero-shot prompt weights to get the context distribution
contained in the image. By minimizing the Kullback-Leibler Divergence (KLD)
between context distributions induced by original/fine-tuned CLIP models,
CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks,
and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD)
accuracy. The experimental results show CAR-FT achieves superior robustness on
five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine
downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization
(DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building
the new state-of-the-art.
Related papers
- Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance [27.91782770050068]
Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data.
Recent works have confirmed that additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, but compromises the model's robustness against distribution shifts.
We propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models.
arXiv Detail & Related papers (2024-04-01T02:01:33Z) - Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results.
It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet.
We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z) - Domain Aligned CLIP for Few-shot Classification [3.5326413171911555]
Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model.
We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
arXiv Detail & Related papers (2023-11-15T18:34:26Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.