AttentionLite: Towards Efficient Self-Attention Models for Vision
- URL: http://arxiv.org/abs/2101.05216v1
- Date: Mon, 21 Dec 2020 17:54:09 GMT
- Title: AttentionLite: Towards Efficient Self-Attention Models for Vision
- Authors: Souvik Kundu, Sairam Sundaresan
- Abstract summary: We propose a novel framework for producing a class of parameter and compute efficient models called AttentionLitesuitable for resource-constrained applications.
We can simultaneously distill knowledge from a compute-heavy teacher while also pruning the student model in a single pass of training.
- Score: 9.957033392865982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel framework for producing a class of parameter and compute
efficient models called AttentionLitesuitable for resource-constrained
applications. Prior work has primarily focused on optimizing models either via
knowledge distillation or pruning. In addition to fusing these two mechanisms,
our joint optimization framework also leverages recent advances in
self-attention as a substitute for convolutions. We can simultaneously distill
knowledge from a compute-heavy teacher while also pruning the student model in
a single pass of training thereby reducing training and fine-tuning times
considerably. We evaluate the merits of our proposed approach on the CIFAR-10,
CIFAR-100, and Tiny-ImageNet datasets. Not only do our AttentionLite models
significantly outperform their unoptimized counterparts in accuracy, we find
that in some cases, that they perform almost as well as their compute-heavy
teachers while consuming only a fraction of the parameters and FLOPs.
Concretely, AttentionLite models can achieve upto30x parameter efficiency and
2x computation efficiency with no significant accuracy drop compared to their
teacher.
Related papers
- Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model.
In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z) - Efficient Deep Learning Board: Training Feedback Is Not All You Need [28.910266386748525]
We propose EfficientDL, an innovative deep learning board for automatic performance prediction and component recommendation.
The magic of no training feedback comes from our proposed comprehensive, multi-dimensional, fine-grained system component dataset.
For example, EfficientDL operates seamlessly with mainstream models such as ResNet50, MobileNetV3, EfficientNet-B0, MaxViT-T, Swin-B, and DaViT-T.
arXiv Detail & Related papers (2024-10-17T14:43:34Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Meta-Ensemble Parameter Learning [35.6391802164328]
In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble.
We introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass.
arXiv Detail & Related papers (2022-10-05T00:47:24Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.