Related papers: Knowledge Distillation in Vision Transformers: A Critical Review

Knowledge Distillation in Vision Transformers: A Critical Review

URL: http://arxiv.org/abs/2302.02108v2
Date: Sat, 10 Feb 2024 11:02:38 GMT
Title: Knowledge Distillation in Vision Transformers: A Critical Review
Authors: Gousia Habib, Tausifa Jan Saleem, Brejesh Lall
Abstract summary: Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs) Model compression has recently attracted considerable research attention as a potential remedy. This paper discusses various approaches based upon KD for effective compression of ViT models.
Score: 6.508088032296086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Natural Language Processing (NLP), Transformers have already revolutionized the field by utilizing an attention-based encoder-decoder model. Recently, some pioneering works have employed Transformer-like architectures in Computer Vision (CV) and they have reported outstanding performance of these architectures in tasks such as image classification, object detection, and semantic segmentation. Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs) due to their competitive modelling capabilities. However, these architectures demand massive computational resources which makes these models difficult to be deployed in the resource-constrained applications. Many solutions have been developed to combat this issue, such as compressive transformers and compression functions such as dilated convolution, min-max pooling, 1D convolution, etc. Model compression has recently attracted considerable research attention as a potential remedy. A number of model compression methods have been proposed in the literature such as weight quantization, weight multiplexing, pruning and Knowledge Distillation (KD). However, techniques like weight quantization, pruning and weight multiplexing typically involve complex pipelines for performing the compression. KD has been found to be a simple and much effective model compression technique that allows a relatively simple model to perform tasks almost as accurately as a complex model. This paper discusses various approaches based upon KD for effective compression of ViT models. The paper elucidates the role played by KD in reducing the computational and memory requirements of these models. The paper also presents the various challenges faced by ViTs that are yet to be resolved.

Related papers

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations. Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z)
Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform [51.29604910007176]
We introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT) The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
arXiv Detail & Related papers (2025-03-09T16:03:48Z)
Computer Vision Model Compression Techniques for Embedded Systems: A Survey [75.38606213726906]
This paper covers the main model compression techniques applied for computer vision tasks. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges.
arXiv Detail & Related papers (2024-08-15T16:41:55Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Blockwise Compression of Transformer-based Models without Retraining [6.118476907408718]
We propose BCT, a framework of blockwise compression for transformers without retraining. Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise. BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
arXiv Detail & Related papers (2023-04-04T02:55:40Z)
I3D: Transformer architectures with input-dependent dynamic depth for speech recognition [41.35563331283372]
We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
arXiv Detail & Related papers (2023-03-14T04:47:00Z)
Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data. Transformers have shown significant performance gains on natural language and high-level vision tasks. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods. We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator. We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z)
Compressing Large-Scale Transformer-Based Models: A Case Study on BERT [41.04066537294312]
Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. These models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications. One potential remedy for this is model compression, which has attracted a lot of research attention.
arXiv Detail & Related papers (2020-02-27T09:20:31Z)
Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods. We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes. By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.