Related papers: Learned Thresholds Token Merging and Pruning for Vision Transformers

Learned Thresholds Token Merging and Pruning for Vision Transformers

URL: http://arxiv.org/abs/2307.10780v2
Date: Thu, 17 Aug 2023 11:51:16 GMT
Title: Learned Thresholds Token Merging and Pruning for Vision Transformers
Authors: Maxim Bonnaerens, Joni Dambre
Abstract summary: This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
Score: 5.141687309207561
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the complexity of transformer models is quadratic with respect to the number of input tokens. Therefore techniques that reduce the number of input tokens that need to be processed have been proposed. This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task. Our results demonstrate that LTMP achieves state-of-the-art accuracy across reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods. Code is available at https://github.com/Mxbonn/ltmp .

Related papers

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [8.584066042703972]
We propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods.<n>Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $times$1.5 with marginal 0.1% accuracy drop.<n>We extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation.
arXiv Detail & Related papers (2025-06-06T03:18:11Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently. They regard each pixel as a token, thus suffering from an information loss issue. We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z)
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs. We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers. We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z)
Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition. We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z)
Learned Token Pruning for Transformers [39.181816379061374]
Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer. We extensively test the performance of our approach on multiple GLUE tasks. Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
arXiv Detail & Related papers (2021-07-02T09:00:13Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.