Related papers: RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

URL: http://arxiv.org/abs/2505.21847v2
Date: Mon, 02 Jun 2025 06:39:14 GMT
Title: RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers
Authors: Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang,
Abstract summary: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency.<n>We propose a novel channel idle mechanism that facilitates post-training structural re parameterization for efficient FFN layers during testing.
Score: 14.876863939653548
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

Related papers

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos. Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead. Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization [102.92240148504774]
We study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT)
arXiv Detail & Related papers (2023-11-10T18:59:54Z)
EIT: Efficiently Lead Inductive Biases to ViT [17.66805405320505]
Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks. We propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT. In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters and FLOPs.
arXiv Detail & Related papers (2022-03-14T14:01:17Z)
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.