CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers
- URL: http://arxiv.org/abs/2504.06704v1
- Date: Wed, 09 Apr 2025 09:08:26 GMT
- Title: CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers
- Authors: Yoshihiro Yamada,
- Abstract summary: We introduce Circular-convolutional ATtention (CAT) to reduce complexity without sacrificing representational power.<n>CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations.<n>Grounded in an engineering-isomorphism framework, CAT's design offers practical efficiency and ease of implementation.
- Score: 0.3626013617212666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations on large-scale benchmarks such as ImageNet-1k and WikiText-103. Grounded in an engineering-isomorphism framework, CAT's design not only offers practical efficiency and ease of implementation but also provides insights to guide the development of next-generation, high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.
Related papers
- BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation [30.912818564963512]
DETRIS is a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation.<n>Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates.
arXiv Detail & Related papers (2025-01-15T05:00:03Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - On the Power of Convolution Augmented Transformer [30.46405043231576]
We study the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks.
Cat incorporates convolutional filters in the K/Q/V embeddings of an attention layer.
We show that the locality of the convolution synergizes with the global view of the attention.
arXiv Detail & Related papers (2024-07-08T04:08:35Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers.
In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - CATs++: Boosting Cost Aggregation with Convolutions and Transformers [31.22435282922934]
We introduce Cost Aggregation with Transformers (CATs) to tackle this by exploring global consensus among initial correlation map.
Also, to alleviate some of the limitations that CATs may face, i.e., high computational costs induced by the use of a standard transformer, we propose CATs++.
Our proposed methods outperform the previous state-of-the-art methods by large margins, setting a new state-of-the-art for all the benchmarks.
arXiv Detail & Related papers (2022-02-14T15:54:58Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.