CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet
- URL: http://arxiv.org/abs/2212.06138v1
- Date: Mon, 12 Dec 2022 18:59:59 GMT
- Title: CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet
- Authors: Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and
Shuyang Gu and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and
Nenghai Yu
- Abstract summary: We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
- Score: 139.56863124214905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that CLIP has achieved remarkable success in
performing zero-shot inference while its fine-tuning performance is not
satisfactory. In this paper, we identify that fine-tuning performance is
significantly impacted by hyper-parameter choices. We examine various key
hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for
classification tasks through a comprehensive study. We find that the
fine-tuning performance of CLIP is substantially underestimated. Equipped with
hyper-parameter refinement, we demonstrate CLIP itself is better or at least
competitive in fine-tuning compared with large-scale supervised pre-training
approaches or latest works that use CLIP as prediction targets in Masked Image
Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve
85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These
observations challenge the conventional conclusion that CLIP is not suitable
for fine-tuning, and motivate us to rethink recently proposed improvements
based on CLIP. We will release our code publicly at
\url{https://github.com/LightDXY/FT-CLIP}.
Related papers
- AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias.
Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z) - FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs [24.991684983495542]
This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations.
We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing Hilbert kernel spaces (RKHSs)
arXiv Detail & Related papers (2024-03-22T19:41:26Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [12.872308743619403]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - CLIP-KD: An Empirical Study of CLIP Model Distillation [24.52910358842176]
This paper aims to distill small CLIP models supervised by a large teacher CLIP model.
We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well.
interactive contrastive learning across teacher and student encoders is also effective in performance improvement.
arXiv Detail & Related papers (2023-07-24T12:24:07Z) - EVA-CLIP: Improved Training Techniques for CLIP at Scale [20.145062325090286]
We propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance.
arXiv Detail & Related papers (2023-03-27T17:02:21Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.