Related papers: Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

URL: http://arxiv.org/abs/2310.11960v4
Date: Thu, 18 Sep 2025 07:14:08 GMT
Title: Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images
Authors: Yanming Kang, Giang Tran, Hans De Sterck,
Abstract summary: We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by n-body physics.<n>FMA reduces the time and memory complexity of self-attention from $mathcalOleft(n2right)$ to $mathcalO(n log n)$ while preserving full-context interactions.<n>We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively.
Score: 0.818198392834469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by the Fast Multipole Method from n-body physics. FMA reduces the time and memory complexity of self-attention from $\mathcal{O}\left(n^2\right)$ to $\mathcal{O}(n \log n)$ and $\mathcal{O}(n)$ while preserving full-context interactions. FMA contains a learned hierarchy with $\mathcal{O}(\log n)$ levels of resolution. In this hierarchy, nearby tokens interact at full resolution, while distant tokens engage through progressively coarser, learned basis functions. We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively. On autoregressive and bidirectional language modeling benchmarks, the 1D variant either matches or outperforms leading efficient attention baselines with substantially lower memory use. With linear complexity, the 2D variant demonstrates superior performance over strong vision transformer baselines in classification and semantic segmentation tasks. Our results confirm that the multilevel attention implemented by FMA allows Transformer-based models to scale to much longer sequences and higher-resolution inputs without loss in accuracy. This provides a principled, physics-inspired approach for developing scalable neural networks suitable for language, vision, and multimodal tasks. Our code will be available at https://github.com/epoch98/FMA.

Related papers

H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics [42.41787036246253]
We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion.<n>We show that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude.
arXiv Detail & Related papers (2025-07-03T16:05:26Z)
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.<n>The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.<n>We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time. By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction [0.0]
Self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. Our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters.
arXiv Detail & Related papers (2024-01-31T15:57:21Z)
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
Progressive Multi-stage Interactive Training in Mobile Network for Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI) First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases. Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages. Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z)
Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts. In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.