EcoFormer: Energy-Saving Attention with Linear Complexity
- URL: http://arxiv.org/abs/2209.09004v3
- Date: Mon, 20 Mar 2023 04:49:10 GMT
- Title: EcoFormer: Energy-Saving Attention with Linear Complexity
- Authors: Jing Liu, Zizheng Pan, Haoyu He, Jianfei Cai, Bohan Zhuang
- Abstract summary: Transformer is a transformative framework that models sequential data.
We propose a new binarization paradigm customized to high-dimensional softmax attention.
We show that EcoFormer consistently achieves comparable performance with standard attentions.
- Score: 40.002608785252164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer is a transformative framework that models sequential data and has
achieved remarkable performance on a wide range of tasks, but with high
computational and energy cost. To improve its efficiency, a popular choice is
to compress the models via binarization which constrains the floating-point
values into binary ones to save resource consumption owing to cheap bitwise
operations significantly. However, existing binarization methods only aim at
minimizing the information loss for the input distribution statistically, while
ignoring the pairwise similarity modeling at the core of the attention. To this
end, we propose a new binarization paradigm customized to high-dimensional
softmax attention via kernelized hashing, called EcoFormer, to map the original
queries and keys into low-dimensional binary codes in Hamming space. The
kernelized hash functions are learned to match the ground-truth similarity
relations extracted from the attention map in a self-supervised way. Based on
the equivalence between the inner product of binary codes and the Hamming
distance as well as the associative property of matrix multiplication, we can
approximate the attention in linear complexity by expressing it as a
dot-product of binary codes. Moreover, the compact binary representations of
queries and keys enable us to replace most of the expensive multiply-accumulate
operations in attention with simple accumulations to save considerable on-chip
energy footprint on edge devices. Extensive experiments on both vision and
language tasks show that EcoFormer consistently achieves comparable performance
with standard attentions while consuming much fewer resources. For example,
based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73% on-chip energy
footprint reduction with only a 0.33% performance drop compared to the standard
attention. Code is available at https://github.com/ziplab/EcoFormer.
Related papers
- CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - BiFormer: Vision Transformer with Bi-Level Routing Attention [26.374724782056557]
We propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness.
Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions.
Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented.
arXiv Detail & Related papers (2023-03-15T17:58:46Z) - Graph-Collaborated Auto-Encoder Hashing for Multi-view Binary Clustering [11.082316688429641]
We propose a hashing algorithm based on auto-encoders for multi-view binary clustering.
Specifically, we propose a multi-view affinity graphs learning model with low-rank constraint, which can mine the underlying geometric information from multi-view data.
We also design an encoder-decoder paradigm to collaborate the multiple affinity graphs, which can learn a unified binary code effectively.
arXiv Detail & Related papers (2023-01-06T12:43:13Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence.
Despite favorable performance, calculating pairwise correlations is prohibitively costly.
This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.