Related papers: On the Power of Convolution Augmented Transformer

On the Power of Convolution Augmented Transformer

URL: http://arxiv.org/abs/2407.05591v1
Date: Mon, 8 Jul 2024 04:08:35 GMT
Title: On the Power of Convolution Augmented Transformer
Authors: Mingchen Li, Xuechen Zhang, Yixiao Huang, Samet Oymak,
Abstract summary: We study the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. Cat incorporates convolutional filters in the K/Q/V embeddings of an attention layer. We show that the locality of the convolution synergizes with the global view of the attention.
Score: 30.46405043231576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer architecture has catalyzed revolutionary advances in language modeling. However, recent architectural recipes, such as state-space models, have bridged the performance gap. Motivated by this, we examine the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. CAT incorporates convolutional filters in the K/Q/V embeddings of an attention layer. Through CAT, we show that the locality of the convolution synergizes with the global view of the attention. Unlike comparable architectures, such as Mamba or transformer, CAT can provably solve the associative recall (AR) and copying tasks using a single layer while also enjoying guaranteed length generalization. We also establish computational tradeoffs between convolution and attention by characterizing how convolution can mitigate the need for full attention by summarizing the context window and creating salient summary tokens to attend. Evaluations on real datasets corroborate our findings and demonstrate that CAT and its variations indeed enhance the language modeling performance.

Related papers

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers [0.3626013617212666]
We introduce Circular-convolutional ATtention (CAT) to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations. Grounded in an engineering-isomorphism framework, CAT's design offers practical efficiency and ease of implementation.
arXiv Detail & Related papers (2025-04-09T09:08:26Z)
Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation [0.0]
KV Transformer shows promising results in synthetic, NLP, and image classification tasks. This is especially conducive to use cases where local inference is required, such as medical screening applications.
arXiv Detail & Related papers (2025-03-24T16:38:31Z)
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction [29.12836710966048]
We propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. Our results call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.
arXiv Detail & Related papers (2024-12-23T18:59:21Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers. We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention [44.67973028541842]
This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences. We propose a new model architecture, referred to as Zebra, which efficiently manages the quadratic time and memory complexity issues associated with full attention. Our model, akin to a zebra's alternating stripes, balances local and global attention layers, significantly reducing computational requirements and memory consumption.
arXiv Detail & Related papers (2023-12-14T02:45:31Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Cross Aggregation Transformer for Image Restoration [48.390140041131886]
Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results. To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT) The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows. Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation in
arXiv Detail & Related papers (2022-11-24T15:09:33Z)
Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth) It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z)
Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images. We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation. We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.