Related papers: ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

URL: http://arxiv.org/abs/2411.06786v1
Date: Mon, 11 Nov 2024 08:25:21 GMT
Title: ScaleKD: Strong Vision Transformers Could Be Excellent Teachers
Authors: Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao,
Abstract summary: We present a simple and effective knowledge distillation method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets. When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
Score: 15.446480934024652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.

Related papers

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z)
Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z)
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance [97.01406871579525]
We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
arXiv Detail & Related papers (2023-09-21T17:59:53Z)
Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition. Specifically, we utilize the web-collected Coyo-700M dataset. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z)
Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models. Our method allows for effortless integration with existing models' training pipelines. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation [35.233203757760066]
We propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Our method achieves significant performance improvements in various computer vision tasks.
arXiv Detail & Related papers (2023-03-23T12:13:29Z)
Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching [49.730741713652435]
In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model. For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
arXiv Detail & Related papers (2023-01-07T17:24:11Z)
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs) However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z)
Masked Autoencoders Enable Efficient Knowledge Distillers [31.606287119666572]
This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. We minimize the distance between the intermediate feature map of the teacher model and that of the student model. Our method can robustly distill knowledge from teacher models even with extremely high masking ratios.
arXiv Detail & Related papers (2022-08-25T17:58:59Z)
Efficient Self-supervised Vision Transformers for Representation Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity. We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models. Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model. We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z)
SEED: Self-supervised Distillation For Visual Representation [34.63488756535054]
We propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. We show that SEED dramatically boosts the performance of small networks on downstream tasks.
arXiv Detail & Related papers (2021-01-12T20:04:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.