Attention Map Guided Transformer Pruning for Edge Device
- URL: http://arxiv.org/abs/2304.01452v1
- Date: Tue, 4 Apr 2023 01:51:53 GMT
- Title: Attention Map Guided Transformer Pruning for Edge Device
- Authors: Junzhu Mao, Yazhou Yao, Zeren Sun, Xingguo Huang, Fumin Shen and
Heng-Tao Shen
- Abstract summary: Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks.
We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads.
Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
- Score: 98.42178656762114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to its significant capability of modeling long-range dependencies, vision
transformer (ViT) has achieved promising success in both holistic and occluded
person re-identification (Re-ID) tasks. However, the inherent problems of
transformers such as the huge computational cost and memory footprint are still
two unsolved issues that will block the deployment of ViT based person Re-ID
models on resource-limited edge devices. Our goal is to reduce both the
inference complexity and model size without sacrificing the comparable accuracy
on person Re-ID, especially for tasks with occlusion. To this end, we propose a
novel attention map guided (AMG) transformer pruning method, which removes both
redundant tokens and heads with the guidance of the attention map in a
hardware-friendly way. We first calculate the entropy in the key dimension and
sum it up for the whole map, and the corresponding head parameters of maps with
high entropy will be removed for model size reduction. Then we combine the
similarity and first-order gradients of key tokens along the query dimension
for token importance estimation and remove redundant key and value tokens to
further reduce the inference complexity. Comprehensive experiments on Occluded
DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals. For
example, our proposed pruning strategy on ViT-Base enjoys
\textup{\textbf{29.4\%}} \textup{\textbf{FLOPs}} savings with
\textup{\textbf{0.2\%}} drop on Rank-1 and \textup{\textbf{0.4\%}} improvement
on mAP, respectively.
Related papers
- Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation [59.1067331268383]
Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data.<n>To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens.<n>We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency.
arXiv Detail & Related papers (2025-08-05T12:40:55Z) - PRISM: Distributed Inference for Foundation Models at Edge [73.54372283220444]
PRISM is a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices.<n>We evaluate PRISM on ViT, BERT, and GPT-2 across diverse datasets.
arXiv Detail & Related papers (2025-07-16T11:25:03Z) - ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference [0.41942958779358674]
Vision Transformers deliver state-of-the-art performance, yet their fixed budget prevents scalable deployment across heterogeneous hardware.<n>We introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference based on input difficulty.<n>ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K.
arXiv Detail & Related papers (2025-07-14T20:54:41Z) - BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information.
HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z) - Size Lowerbounds for Deep Operator Networks [0.27195102129094995]
We establish a data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data.
We demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.
arXiv Detail & Related papers (2023-08-11T18:26:09Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Indirect-Instant Attention Optimization for Crowd Counting in Dense
Scenes [3.8950254639440094]
Indirect-Instant Attention Optimization (IIAO) module based on SoftMax-Attention.
Special transformation yields relatively coarse features and, originally, the predictive fallibility of regions varies by crowd density distribution.
We tailor the Regional Correlation Loss (RCLoss) to retrieve continuous error-prone regions and smooth spatial information.
arXiv Detail & Related papers (2022-06-12T03:29:50Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - OH-Former: Omni-Relational High-Order Transformer for Person
Re-Identification [30.023365814501137]
We propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for person re-identification (ReID)
The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
arXiv Detail & Related papers (2021-09-23T06:11:38Z) - Is 2D Heatmap Representation Even Necessary for Human Pose Estimation? [44.313782042852246]
We propose a textbfSimple yet promising textbfDisentangled textbfRepresentation for keypoint coordinate (emphSimDR)
In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement.
arXiv Detail & Related papers (2021-07-07T16:20:12Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation.
We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component.
Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.