CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet
- URL: http://arxiv.org/abs/2212.06138v1
- Date: Mon, 12 Dec 2022 18:59:59 GMT
- Title: CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet
- Authors: Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and
Shuyang Gu and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and
Nenghai Yu
- Abstract summary: We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
- Score: 139.56863124214905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that CLIP has achieved remarkable success in
performing zero-shot inference while its fine-tuning performance is not
satisfactory. In this paper, we identify that fine-tuning performance is
significantly impacted by hyper-parameter choices. We examine various key
hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for
classification tasks through a comprehensive study. We find that the
fine-tuning performance of CLIP is substantially underestimated. Equipped with
hyper-parameter refinement, we demonstrate CLIP itself is better or at least
competitive in fine-tuning compared with large-scale supervised pre-training
approaches or latest works that use CLIP as prediction targets in Masked Image
Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve
85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These
observations challenge the conventional conclusion that CLIP is not suitable
for fine-tuning, and motivate us to rethink recently proposed improvements
based on CLIP. We will release our code publicly at
\url{https://github.com/LightDXY/FT-CLIP}.
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [53.91006249339802]
We propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks.
TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP.
We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets.
arXiv Detail & Related papers (2024-11-20T08:58:59Z) - FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs [24.991684983495542]
This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations.
We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing Hilbert kernel spaces (RKHSs)
arXiv Detail & Related papers (2024-03-22T19:41:26Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - CLIP-KD: An Empirical Study of CLIP Model Distillation [24.52910358842176]
This paper aims to distill small CLIP models supervised by a large teacher CLIP model.
We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well.
interactive contrastive learning across teacher and student encoders is also effective in performance improvement.
arXiv Detail & Related papers (2023-07-24T12:24:07Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.