Related papers: MSWA: Refining Local Attention with Multi-ScaleWindow Attention

MSWA: Refining Local Attention with Multi-ScaleWindow Attention

URL: http://arxiv.org/abs/2501.01039v1
Date: Thu, 02 Jan 2025 03:41:32 GMT
Title: MSWA: Refining Local Attention with Multi-ScaleWindow Attention
Authors: Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum,
Abstract summary: Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window.<n>We propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer.<n>It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances.
Score: 14.481768894355522
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.

Related papers

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes. AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z)
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution [70.52256118833583]
We present a strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR) Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales. Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes.
arXiv Detail & Related papers (2024-07-08T12:42:10Z)
Multi-Scale Representations by Varying Window Attention for Semantic Segmentation [10.549932900057462]
A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. We also introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation.
arXiv Detail & Related papers (2024-04-25T12:35:27Z)
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs) Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners [17.747301325787618]
Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module. MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations.
arXiv Detail & Related papers (2023-06-01T11:20:59Z)
DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z)
Parallel Context Windows for Large Language Models [52.965170346907904]
We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. Our main results test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters. We show additional benefits in other settings where long context windows may be beneficial: multi-hop questions and retrieval-augmented question answering with multiple retrieved documents.
arXiv Detail & Related papers (2022-12-21T11:38:51Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
VSA: Learning Varied-Size Window Attention in Vision Transformers [76.35955924137986]
We propose textbfVaried-textbfSize Window textbfAttention (VSA) to learn adaptive window configurations from data. Based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window.
arXiv Detail & Related papers (2022-04-18T17:56:07Z)
MixFormer: Mixing Features across Windows and Dimensions [68.86393312123168]
Local-window self-attention performs notably in vision tasks, but suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields.
arXiv Detail & Related papers (2022-04-06T03:13:50Z)
Beyond Fixation: Dynamic Window Visual Transformer [19.982593341611796]
We propose a novel method, named Dynamic Window Vision Transformer (DW-ViT) In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.
arXiv Detail & Related papers (2022-03-24T05:38:07Z)
Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting. HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information. Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.