Sparse then Prune: Toward Efficient Vision Transformers
- URL: http://arxiv.org/abs/2307.11988v1
- Date: Sat, 22 Jul 2023 05:43:33 GMT
- Title: Sparse then Prune: Toward Efficient Vision Transformers
- Authors: Yogi Prasetyo, Novanto Yudistira, Agus Wahyu Widodo
- Abstract summary: Vision Transformer is a deep learning model inspired by the success of the Transformer model in Natural Language Processing.
Applying Sparse Regularization to Vision Transformers can increase accuracy by 0.12%.
Applying pruning to models with Sparse Regularization yields even better results.
- Score: 2.191505742658975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Vision Transformer architecture is a deep learning model inspired by the
success of the Transformer model in Natural Language Processing. However, the
self-attention mechanism, large number of parameters, and the requirement for a
substantial amount of training data still make Vision Transformers
computationally burdensome. In this research, we investigate the possibility of
applying Sparse Regularization to Vision Transformers and the impact of
Pruning, either after Sparse Regularization or without it, on the trade-off
between performance and efficiency. To accomplish this, we apply Sparse
Regularization and Pruning methods to the Vision Transformer architecture for
image classification tasks on the CIFAR-10, CIFAR-100, and ImageNet-100
datasets. The training process for the Vision Transformer model consists of two
parts: pre-training and fine-tuning. Pre-training utilizes ImageNet21K data,
followed by fine-tuning for 20 epochs. The results show that when testing with
CIFAR-100 and ImageNet-100 data, models with Sparse Regularization can increase
accuracy by 0.12%. Furthermore, applying pruning to models with Sparse
Regularization yields even better results. Specifically, it increases the
average accuracy by 0.568% on CIFAR-10 data, 1.764% on CIFAR-100, and 0.256% on
ImageNet-100 data compared to pruning models without Sparse Regularization.
Code can be accesed here: https://github.com/yogiprsty/Sparse-ViT
Related papers
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - Pre-training of Lightweight Vision Transformers on Small Datasets with
Minimally Scaled Images [0.0]
A pure Vision Transformer (ViT) can achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling.
Experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G.
arXiv Detail & Related papers (2024-02-06T06:41:24Z) - Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness
with Dataset Reinforcement [68.44100784364987]
We propose a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users.
We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+.
Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks.
arXiv Detail & Related papers (2023-03-15T23:10:17Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z) - On the Generalization Effects of Linear Transformations in Data
Augmentation [32.01435459892255]
Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks.
We study a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting.
We propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data.
arXiv Detail & Related papers (2020-05-02T04:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.