Related papers: DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

URL: http://arxiv.org/abs/2203.01178v2
Date: Thu, 3 Mar 2022 09:55:56 GMT
Title: DCT-Former: Efficient Self-Attention with Discrete Cosine Transform
Authors: Carmelo Scribano, Giorgia Franchini, Marco Prato and Marko Bertogna
Abstract summary: An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
Score: 4.622165486890318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as $O(n^2)$ where $n$ stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public

Related papers

Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization' This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs [4.165917157093442]
This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup. It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality. Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives.
arXiv Detail & Related papers (2024-09-27T13:17:59Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation [6.576180048533476]
PaSeR (Parsimonious with Reinforcement Learning) is a non-cascading, cost-aware learning pipeline. We show that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. We introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance.
arXiv Detail & Related papers (2024-02-19T01:17:52Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
Topology-aware Embedding Memory for Continual Learning on Expanding Networks [63.35819388164267]
We present a framework to tackle the memory explosion problem using memory replay techniques. PDGNNs with Topology-aware Embedding Memory (TEM) significantly outperform state-of-the-art techniques.
arXiv Detail & Related papers (2024-01-24T03:03:17Z)
Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks. By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead. We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z)
Scalable Adaptive Computation for Iterative Generation [13.339848496653465]
Recurrent Interface Networks (RINs) are an attention-based architecture that decouples its core computation from the dimensionality of the data. RINs focus the bulk of computation on a set of latent tokens, using cross-attention to read and write information between latent and data tokens. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance.
arXiv Detail & Related papers (2022-12-22T18:55:45Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.