Knowledge Distillation in Vision Transformers: A Critical Review
- URL: http://arxiv.org/abs/2302.02108v2
- Date: Sat, 10 Feb 2024 11:02:38 GMT
- Title: Knowledge Distillation in Vision Transformers: A Critical Review
- Authors: Gousia Habib, Tausifa Jan Saleem, Brejesh Lall
- Abstract summary: Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs)
Model compression has recently attracted considerable research attention as a potential remedy.
This paper discusses various approaches based upon KD for effective compression of ViT models.
- Score: 6.508088032296086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Natural Language Processing (NLP), Transformers have already
revolutionized the field by utilizing an attention-based encoder-decoder model.
Recently, some pioneering works have employed Transformer-like architectures in
Computer Vision (CV) and they have reported outstanding performance of these
architectures in tasks such as image classification, object detection, and
semantic segmentation. Vision Transformers (ViTs) have demonstrated impressive
performance improvements over Convolutional Neural Networks (CNNs) due to their
competitive modelling capabilities. However, these architectures demand massive
computational resources which makes these models difficult to be deployed in
the resource-constrained applications. Many solutions have been developed to
combat this issue, such as compressive transformers and compression functions
such as dilated convolution, min-max pooling, 1D convolution, etc. Model
compression has recently attracted considerable research attention as a
potential remedy. A number of model compression methods have been proposed in
the literature such as weight quantization, weight multiplexing, pruning and
Knowledge Distillation (KD). However, techniques like weight quantization,
pruning and weight multiplexing typically involve complex pipelines for
performing the compression. KD has been found to be a simple and much effective
model compression technique that allows a relatively simple model to perform
tasks almost as accurately as a complex model. This paper discusses various
approaches based upon KD for effective compression of ViT models. The paper
elucidates the role played by KD in reducing the computational and memory
requirements of these models. The paper also presents the various challenges
faced by ViTs that are yet to be resolved.
Related papers
- Computer Vision Model Compression Techniques for Embedded Systems: A Survey [75.38606213726906]
This paper covers the main model compression techniques applied for computer vision tasks.
We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique.
We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges.
arXiv Detail & Related papers (2024-08-15T16:41:55Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Blockwise Compression of Transformer-based Models without Retraining [6.118476907408718]
We propose BCT, a framework of blockwise compression for transformers without retraining.
Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise.
BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
arXiv Detail & Related papers (2023-04-04T02:55:40Z) - I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods.
We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator.
We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z) - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT [41.04066537294312]
Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks.
These models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications.
One potential remedy for this is model compression, which has attracted a lot of research attention.
arXiv Detail & Related papers (2020-02-27T09:20:31Z) - Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods.
We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes.
By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.