Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
- URL: http://arxiv.org/abs/2507.19175v1
- Date: Fri, 25 Jul 2025 11:31:17 GMT
- Title: Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
- Authors: Yuki Igaue, Hiroaki Aizawa,
- Abstract summary: We propose a patch pruning strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads.<n>This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations.
- Score: 0.7673339435080445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-head self-attention is a distinctive feature extraction mechanism of vision transformers that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur a quadratic computational complexity with respect to the number of patches. One promising approach to address this issue is patch pruning, which improves computational efficiency by identifying and removing redundant patches. In this work, we propose a patch pruning strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads. This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations. The proposed method can be easily applied during both training and inference, and achieves improved throughput while maintaining classification accuracy in scenarios such as fine-tuning with pre-trained models. In addition, we also found that using robust statistical measures, such as the median absolute deviation in place of variance, to assess patch importance can similarly lead to strong performance. Furthermore, by introducing overlapping patch embeddings, our method achieves better performance with comparable throughput to conventional approaches that utilize all patches.
Related papers
- Focus What Matters: Matchability-Based Reweighting for Local Feature Matching [6.361840891399624]
We propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits.<n>Experiments conducted on three benchmark datasets validate the effectiveness of our method.
arXiv Detail & Related papers (2025-05-04T15:50:28Z) - Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Variance-Reducing Couplings for Random Features [57.73648780299374]
Random features (RFs) are a popular technique to scale up kernel methods in machine learning.
We find couplings to improve RFs defined on both Euclidean and discrete input spaces.
We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm.
arXiv Detail & Related papers (2024-05-26T12:25:09Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - MVPatch: More Vivid Patch for Adversarial Camouflaged Attacks on Object Detectors in the Physical World [7.1343035828597685]
We introduce generalization theory into the context of Adversarial Patches (APs)
We propose a Dual-Perception-Based Framework (DPBF) to generate the More Vivid Patch (MVPatch), which enhances transferability, stealthiness, and practicality.
MVPatch achieves superior transferability and a natural appearance in both digital and physical domains, underscoring its effectiveness and stealthiness.
arXiv Detail & Related papers (2023-12-29T01:52:22Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Harnessing Perceptual Adversarial Patches for Crowd Counting [92.79051296850405]
Crowd counting is vulnerable to adversarial examples in the physical world.
This paper proposes the Perceptual Adrial Patch (PAP) generation framework to learn the shared perceptual features between models.
arXiv Detail & Related papers (2021-09-16T13:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.