Related papers: Attention Map Guided Transformer Pruning for Edge Device

Attention Map Guided Transformer Pruning for Edge Device

URL: http://arxiv.org/abs/2304.01452v1
Date: Tue, 4 Apr 2023 01:51:53 GMT
Title: Attention Map Guided Transformer Pruning for Edge Device
Authors: Junzhu Mao, Yazhou Yao, Zeren Sun, Xingguo Huang, Fumin Shen and Heng-Tao Shen
Abstract summary: Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
Score: 98.42178656762114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to its significant capability of modeling long-range dependencies, vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. However, the inherent problems of transformers such as the huge computational cost and memory footprint are still two unsolved issues that will block the deployment of ViT based person Re-ID models on resource-limited edge devices. Our goal is to reduce both the inference complexity and model size without sacrificing the comparable accuracy on person Re-ID, especially for tasks with occlusion. To this end, we propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads with the guidance of the attention map in a hardware-friendly way. We first calculate the entropy in the key dimension and sum it up for the whole map, and the corresponding head parameters of maps with high entropy will be removed for model size reduction. Then we combine the similarity and first-order gradients of key tokens along the query dimension for token importance estimation and remove redundant key and value tokens to further reduce the inference complexity. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals. For example, our proposed pruning strategy on ViT-Base enjoys \textup{\textbf{29.4\%}} \textup{\textbf{FLOPs}} savings with \textup{\textbf{0.2\%}} drop on Rank-1 and \textup{\textbf{0.4\%}} improvement on mAP, respectively.

Related papers

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z)
Size Lowerbounds for Deep Operator Networks [0.27195102129094995]
We establish a data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. We demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.
arXiv Detail & Related papers (2023-08-11T18:26:09Z)
Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z)
Indirect-Instant Attention Optimization for Crowd Counting in Dense Scenes [3.8950254639440094]
Indirect-Instant Attention Optimization (IIAO) module based on SoftMax-Attention. Special transformation yields relatively coarse features and, originally, the predictive fallibility of regions varies by crowd density distribution. We tailor the Regional Correlation Loss (RCLoss) to retrieve continuous error-prone regions and smooth spatial information.
arXiv Detail & Related papers (2022-06-12T03:29:50Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [30.023365814501137]
We propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for person re-identification (ReID) The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
arXiv Detail & Related papers (2021-09-23T06:11:38Z)
Is 2D Heatmap Representation Even Necessary for Human Pose Estimation? [44.313782042852246]
We propose a textbfSimple yet promising textbfDisentangled textbfRepresentation for keypoint coordinate (emphSimDR) In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement.
arXiv Detail & Related papers (2021-07-07T16:20:12Z)
Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts. We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation. We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component. Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.