Token Pooling in Vision Transformers
- URL: http://arxiv.org/abs/2110.03860v2
- Date: Mon, 11 Oct 2021 15:17:21 GMT
- Title: Token Pooling in Vision Transformers
- Authors: Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu,
Mohammad Rastegari, Oncel Tuzel
- Abstract summary: In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
- Score: 37.11990688046186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent success in many applications, the high computational
requirements of vision transformers limit their use in resource-constrained
settings. While many existing methods improve the quadratic complexity of
attention, in most vision transformers, self-attention is not the major
computation bottleneck, e.g., more than 80% of the computation is spent on
fully-connected layers. To improve the computational complexity of all layers,
we propose a novel token downsampling method, called Token Pooling, efficiently
exploiting redundancies in the images and intermediate token representations.
We show that, under mild assumptions, softmax-attention acts as a
high-dimensional low-pass (smoothing) filter. Thus, its output contains
redundancy that can be pruned to achieve a better trade-off between the
computational cost and accuracy. Our new technique accurately approximates a
set of tokens by minimizing the reconstruction error caused by downsampling. We
solve this optimization problem via cost-efficient clustering. We rigorously
analyze and compare to prior downsampling methods. Our experiments show that
Token Pooling significantly improves the cost-accuracy trade-off over the
state-of-the-art downsampling. Token Pooling is a simple and effective operator
that can benefit many architectures. Applied to DeiT, it achieves the same
ImageNet top-1 accuracy using 42% fewer computations.
Related papers
- Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently.
They regard each pixel as a token, thus suffering from an information loss issue.
We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z) - PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT)
PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters.
It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Learning strides in convolutional neural networks [34.20666933112202]
This work introduces DiffStride, the first downsampling layer with learnable strides.
Experiments on audio and image classification show the generality and effectiveness of our solution.
arXiv Detail & Related papers (2022-02-03T16:03:36Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - AdaPool: Exponential Adaptive Pooling for Information-Retaining
Downsampling [82.08631594071656]
Pooling layers are essential building blocks of Convolutional Neural Networks (CNNs)
We propose an adaptive and exponentially weighted pooling method named adaPool.
We demonstrate how adaPool improves the preservation of detail through a range of tasks including image and video classification and object detection.
arXiv Detail & Related papers (2021-11-01T08:50:37Z) - Refining activation downsampling with SoftPool [74.1840492087968]
Convolutional Neural Networks (CNNs) use pooling to decrease the size of activation maps.
We propose SoftPool: a fast and efficient method for exponentially weighted activation downsampling.
We show that SoftPool can retain more information in the reduced activation maps.
arXiv Detail & Related papers (2021-01-02T12:09:49Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.