Related papers: ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

URL: http://arxiv.org/abs/2306.06446v6
Date: Thu, 25 Jul 2024 17:19:31 GMT
Title: ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer
Authors: Haoran You, Huihong Shi, Yipin Guo, Yingyan Celine Lin,
Abstract summary: We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs. Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
Score: 6.473688838974095
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $\textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $\texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $\textbf{5.18$\times$}$ latency reductions on GPUs and $\textbf{42.9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs.

Related papers

ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration [8.829482765731022]
$N:M$ sparsity is an emerging model compression method supported by more and more accelerators. We propose ELSA, Exploiting Layer-wise $N:M$ Sparsity for ViTs.
arXiv Detail & Related papers (2024-09-15T12:14:24Z)
P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer [8.22044535304182]
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors. We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
arXiv Detail & Related papers (2024-05-30T10:26:36Z)
Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer [5.141764719319689]
Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. We propose Trio-ViT, which eliminates the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly.
arXiv Detail & Related papers (2024-05-06T21:57:35Z)
PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT) PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters. It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z)
Experts Weights Averaging: A New General Training Scheme for Vision Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
MixFormerV2: Efficient Fully Transformer Tracking [49.07428299165031]
Transformer-based trackers have achieved strong accuracy on the standard benchmarks. But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
arXiv Detail & Related papers (2023-05-25T09:50:54Z)
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck. We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures. This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference. We show that the improved robustness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.