Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
- URL: http://arxiv.org/abs/2305.10727v1
- Date: Thu, 18 May 2023 05:55:48 GMT
- Title: Boost Vision Transformer with GPU-Friendly Sparsity and Quantization
- Authors: Chong Yu, Tao Chen, Zhongxue Gan, Jiayuan Fan
- Abstract summary: This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization.
Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs.
- Score: 29.96026533220083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer extends its success from the language to the vision domain.
Because of the stacked self-attention and cross-attention blocks, the
acceleration deployment of vision transformer on GPU hardware is challenging
and also rarely studied. This paper thoroughly designs a compression scheme to
maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and
quantization. Specially, an original large model with dense weight parameters
is first pruned into a sparse one by 2:4 structured pruning, which considers
the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type,
then the floating-point sparse model is further quantized into a fixed-point
one by sparse-distillation-aware quantization aware training, which considers
GPU can provide an extra speedup of 2:4 sparse calculation with integer
tensors. A mixed-strategy knowledge distillation is used during the pruning and
quantization process. The proposed compression scheme is flexible to support
supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT
scheme achieves state-of-the-art compression by reducing vision transformer
models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible
accuracy degradation on ImageNet classification, COCO detection and ADE20K
segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual
deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and
throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of
latency and throughput on AGX Orin.
Related papers
- Accelerating Transformer Pre-training with 2:4 Sparsity [19.64391647966267]
NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent.
We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality.
Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
arXiv Detail & Related papers (2024-04-02T11:12:42Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - MixFormerV2: Efficient Fully Transformer Tracking [49.07428299165031]
Transformer-based trackers have achieved strong accuracy on the standard benchmarks.
But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms.
We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
arXiv Detail & Related papers (2023-05-25T09:50:54Z) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers [34.91478831993398]
GPTQ is a new one-shot weight quantization method based on approximate second-order information.
It can quantize GPT models with 175 billion parameters in approximately four GPU hours.
Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods.
arXiv Detail & Related papers (2022-10-31T13:42:40Z) - Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision
Transformer with Mixed-Scheme Quantization [78.18328503396057]
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks.
This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization.
arXiv Detail & Related papers (2022-08-10T05:54:46Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - FantastIC4: A Hardware-Software Co-Design Approach for Efficiently
Running 4bit-Compact Multilayer Perceptrons [19.411734658680967]
We propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs)
Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances.
We show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version.
arXiv Detail & Related papers (2020-12-17T19:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.