TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance
- URL: http://arxiv.org/abs/2309.12314v1
- Date: Thu, 21 Sep 2023 17:59:53 GMT
- Title: TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance
- Authors: Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan,
Hong Xuan, Michael Valenzuela, Xi (Stephen) Chen, Xinggang Wang, Hongyang
Chao, Han Hu
- Abstract summary: We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance.
Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
- Score: 97.01406871579525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel cross-modal distillation method, called
TinyCLIP, for large-scale language-image pre-trained models. The method
introduces two core techniques: affinity mimicking and weight inheritance.
Affinity mimicking explores the interaction between modalities during
distillation, enabling student models to mimic teachers' behavior of learning
cross-modal feature alignment in a visual-linguistic affinity space. Weight
inheritance transmits the pre-trained weights from the teacher models to their
student counterparts to improve distillation efficiency. Moreover, we extend
the method into a multi-stage progressive distillation to mitigate the loss of
informative weights during extreme compression. Comprehensive experiments
demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of
the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot
performance. While aiming for comparable performance, distillation with weight
inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to
training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M,
achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet,
surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9%
parameters. Finally, we demonstrate the good transferability of TinyCLIP in
various downstream tasks. Code and models will be open-sourced at
https://aka.ms/tinyclip.
Related papers
- Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training [17.158498267947877]
We introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance.
MobileCLIP uses knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models.
Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
arXiv Detail & Related papers (2023-11-28T18:55:42Z) - CLIP-KD: An Empirical Study of CLIP Model Distillation [24.52910358842176]
This paper aims to distill small CLIP models supervised by a large teacher CLIP model.
We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well.
interactive contrastive learning across teacher and student encoders is also effective in performance improvement.
arXiv Detail & Related papers (2023-07-24T12:24:07Z) - Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
Image Modeling For Label-Efficient Representations [130.05189514598996]
Mask Image Modeling (MIM) and Contrastive Learning (CL) demonstrate that self-supervision is powerful to learn good representations.
In this paper, we make the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions.
Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively.
We propose a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss.
arXiv Detail & Related papers (2023-02-27T20:52:10Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Masked Autoencoders Enable Efficient Knowledge Distillers [31.606287119666572]
This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders.
We minimize the distance between the intermediate feature map of the teacher model and that of the student model.
Our method can robustly distill knowledge from teacher models even with extremely high masking ratios.
arXiv Detail & Related papers (2022-08-25T17:58:59Z) - DisCo: Remedy Self-supervised Learning on Lightweight Models with
Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community.
Recent research argue that its performance will suffer a cliff fall when the model size decreases.
We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.