Generic-to-Specific Distillation of Masked Autoencoders
- URL: http://arxiv.org/abs/2302.14771v1
- Date: Tue, 28 Feb 2023 17:13:14 GMT
- Title: Generic-to-Specific Distillation of Masked Autoencoders
- Authors: Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye
- Abstract summary: We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders.
With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
- Score: 119.21281960831651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision Transformers (ViTs) driven by self-supervised pre-training
mechanisms achieved unprecedented progress. Lightweight ViT models limited by
the model capacity, however, benefit little from those pre-training mechanisms.
Knowledge distillation defines a paradigm to transfer representations from
large (teacher) models to small (student) ones. However, the conventional
single-stage distillation easily gets stuck on task-specific transfer, failing
to retain the task-agnostic knowledge crucial for model generalization. In this
study, we propose generic-to-specific distillation (G2SD), to tap the potential
of small ViT models under the supervision of large models pre-trained by masked
autoencoders. In generic distillation, decoder of the small model is encouraged
to align feature predictions with hidden representations of the large model, so
that task-agnostic knowledge can be transferred. In specific distillation,
predictions of the small model are constrained to be consistent with those of
the large model, to transfer task-specific features which guarantee task
performance. With G2SD, the vanilla ViT-Small model respectively achieves
98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image
classification, object detection, and semantic segmentation, setting a solid
baseline for two-stage vision distillation. Code will be available at
https://github.com/pengzhiliang/G2SD.
Related papers
- Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - Prototype-guided Cross-task Knowledge Distillation for Large-scale
Models [103.04711721343278]
Cross-task knowledge distillation helps to train a small student model to obtain a competitive performance.
We propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios.
arXiv Detail & Related papers (2022-12-26T15:00:42Z) - DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and
Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks.
These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency.
We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.