HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer
- URL: http://arxiv.org/abs/2512.20120v1
- Date: Tue, 23 Dec 2025 07:23:16 GMT
- Title: HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer
- Authors: Mohammad Helal Uddin, Liam Seymour, Sabur Baidya,
- Abstract summary: We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers.<n> HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products.<n>On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput.
- Score: 3.652580364273503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets.This dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.
Related papers
- ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT [14.21482208417138]
Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs.<n>We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components.
arXiv Detail & Related papers (2026-02-17T16:52:13Z) - Understanding vision transformer robustness through the lens of out-of-distribution detection [59.72757235382676]
Quantization reduces memory and inference costs at the risk of performance loss.<n>We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common out-of-distribution (OOD) datasets.
arXiv Detail & Related papers (2026-02-01T22:00:59Z) - TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing [12.583079680322156]
We propose TReX, an attention-reuse-driven ViT optimization framework.
We find that TReX achieves 2.3x (2.19x) EDAP reduction and 1.86x (1.79x) TOPS/mm2 improvement with 1% accuracy drop in case of DeiT-S (LV-ViT-S) ViT models.
On NLP tasks such as CoLA, TReX leads to 2% higher non-ideal accuracy compared to baseline at 1.6x lower EDAP.
arXiv Detail & Related papers (2024-08-22T21:51:38Z) - HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - Sub-token ViT Embedding via Stochastic Resonance Transformers [51.12001699637727]
Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch.
We propose a training-free method inspired by "stochastic resonance"
The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization.
arXiv Detail & Related papers (2023-10-06T01:53:27Z) - Bi-ViT: Pushing the Limit of Vision Transformer Quantization [38.24456467950003]
Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices.
We introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses.
We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework.
arXiv Detail & Related papers (2023-05-21T05:24:43Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - SepViT: Separable Vision Transformer [20.403430632658946]
Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices.
We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT.
SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
arXiv Detail & Related papers (2022-03-29T09:20:01Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.