DCT-Former: Efficient Self-Attention with Discrete Cosine Transform
- URL: http://arxiv.org/abs/2203.01178v2
- Date: Thu, 3 Mar 2022 09:55:56 GMT
- Title: DCT-Former: Efficient Self-Attention with Discrete Cosine Transform
- Authors: Carmelo Scribano, Giorgia Franchini, Marco Prato and Marko Bertogna
- Abstract summary: An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
- Score: 4.622165486890318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since their introduction the Trasformer architectures emerged as the
dominating architectures for both natural language processing and, more
recently, computer vision applications. An intrinsic limitation of this family
of "fully-attentive" architectures arises from the computation of the
dot-product attention, which grows both in memory consumption and number of
operations as $O(n^2)$ where $n$ stands for the input sequence length, thus
limiting the applications that require modeling very long sequences. Several
approaches have been proposed so far in the literature to mitigate this issue,
with varying degrees of success. Our idea takes inspiration from the world of
lossy data compression (such as the JPEG algorithm) to derive an approximation
of the attention module by leveraging the properties of the Discrete Cosine
Transform. An extensive section of experiments shows that our method takes up
less memory for the same performance, while also drastically reducing inference
time. This makes it particularly suitable in real-time contexts on embedded
platforms. Moreover, we assume that the results of our research might serve as
a starting point for a broader family of deep neural models with reduced memory
footprint. The implementation will be made publicly available at
https://github.com/cscribano/DCT-Former-Public
Related papers
- Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs [4.165917157093442]
This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup.
It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality.
Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives.
arXiv Detail & Related papers (2024-09-27T13:17:59Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Reinforcement Learning as a Parsimonious Alternative to Prediction
Cascades: A Case Study on Image Segmentation [6.576180048533476]
PaSeR (Parsimonious with Reinforcement Learning) is a non-cascading, cost-aware learning pipeline.
We show that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models.
We introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance.
arXiv Detail & Related papers (2024-02-19T01:17:52Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Topology-aware Embedding Memory for Continual Learning on Expanding Networks [63.35819388164267]
We present a framework to tackle the memory explosion problem using memory replay techniques.
PDGNNs with Topology-aware Embedding Memory (TEM) significantly outperform state-of-the-art techniques.
arXiv Detail & Related papers (2024-01-24T03:03:17Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - Scalable Adaptive Computation for Iterative Generation [13.339848496653465]
Recurrent Interface Networks (RINs) are an attention-based architecture that decouples its core computation from the dimensionality of the data.
RINs focus the bulk of computation on a set of latent tokens, using cross-attention to read and write information between latent and data tokens.
RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance.
arXiv Detail & Related papers (2022-12-22T18:55:45Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.