MSWA: Refining Local Attention with Multi-ScaleWindow Attention
- URL: http://arxiv.org/abs/2501.01039v1
- Date: Thu, 02 Jan 2025 03:41:32 GMT
- Title: MSWA: Refining Local Attention with Multi-ScaleWindow Attention
- Authors: Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum,
- Abstract summary: Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window.
We propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer.
It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances.
- Score: 14.481768894355522
- License:
- Abstract: Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.
Related papers
- AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance.
LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z) - HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution [70.52256118833583]
We present a strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR)
Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales.
Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes.
arXiv Detail & Related papers (2024-07-08T12:42:10Z) - Multi-Scale Representations by Varying Window Attention for Semantic Segmentation [10.549932900057462]
A novel multi-scale learner, varying window attention (VWA), is presented to address these issues.
We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance.
We also introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation.
arXiv Detail & Related papers (2024-04-25T12:35:27Z) - Masked Autoencoders with Multi-Window Local-Global Attention Are Better
Audio Learners [17.747301325787618]
Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module.
MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations.
arXiv Detail & Related papers (2023-06-01T11:20:59Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - VSA: Learning Varied-Size Window Attention in Vision Transformers [76.35955924137986]
We propose textbfVaried-textbfSize Window textbfAttention (VSA) to learn adaptive window configurations from data.
Based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window.
arXiv Detail & Related papers (2022-04-18T17:56:07Z) - MixFormer: Mixing Features across Windows and Dimensions [68.86393312123168]
Local-window self-attention performs notably in vision tasks, but suffers from limited receptive field and weak modeling capability issues.
This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension.
We combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields.
arXiv Detail & Related papers (2022-04-06T03:13:50Z) - Beyond Fixation: Dynamic Window Visual Transformer [19.982593341611796]
We propose a novel method, named Dynamic Window Vision Transformer (DW-ViT)
In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention.
DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.
arXiv Detail & Related papers (2022-03-24T05:38:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.