Related papers: How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

URL: http://arxiv.org/abs/2511.06044v1
Date: Sat, 08 Nov 2025 15:34:15 GMT
Title: How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy
Authors: Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang,
Abstract summary: We put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity.<n>RBA has several significant strengths as follows:.<n>It can be implemented in parallel on a new dimension, which contributes to much memory saving.
Score: 39.666337901651865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.

Related papers

Poly-attention: a general scheme for higher-order self-attention [16.719964872886315]
We define a vast class of generalizations of self-attention, which we call poly-attention mechanisms.<n>Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens.<n>We give new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix.
arXiv Detail & Related papers (2026-02-02T18:24:53Z)
Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers.<n>We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z)
Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning. Two arguments have connected attention localization to the model performances. We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z)
Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches. We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
Understanding Interlocking Dynamics of Cooperative Rationalization [90.6863969334526]
Selective rationalization explains the prediction of complex neural networks by finding a small subset of the input that is sufficient to predict the neural model output. We reveal a major problem with such cooperative rationalization paradigm -- model interlocking. We propose a new rationalization framework, called A2R, which introduces a third component into the architecture, a predictor driven by soft attention as opposed to selection.
arXiv Detail & Related papers (2021-10-26T17:39:18Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention [7.967230034960396]
We evaluate whether various active-memory mechanisms could replace self-attention in a Transformer. Experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling. For some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.
arXiv Detail & Related papers (2019-12-27T02:01:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.