Related papers: Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers

Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers

URL: http://arxiv.org/abs/2601.22852v1
Date: Fri, 30 Jan 2026 11:23:14 GMT
Title: Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers
Authors: Robert Forchheimer,
Abstract summary: We introduce HSM, a framework for token mixing that distributes pairwise token interactions across Transformer layers.<n>HSM enables linear-time complexity while remaining to the specific mixing function.<n>We show that even simple HSM variants achieve performance close to softmax attention.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since the introduction of the Transformer architecture for large language models, the softmax-based attention layer has faced increasing scrutinity due to its quadratic-time computational complexity. Attempts have been made to replace it with less complex methods, at the cost of reduced performance in most cases. We introduce Hierarchical Shift Mixing (HSM), a general framework for token mixing that distributes pairwise token interactions across Transformer layers rather than computing them densely within each layer. HSM enables linear-time complexity while remaining agnostic to the specific mixing function. We show that even simple HSM variants achieve performance close to softmax attention, and that hybrid architectures combining HSM with softmax attention can outperform a GPT-style Transformer baseline while reducing computational cost during both training and inference.

Related papers

From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs [6.873342825786888]
Transformer-based neural operators have emerged as powerful data-driven alternatives.<n>We propose DynFormer, a novel dynamics-informed neural operator.<n>We show that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-03T15:45:09Z)
Sparse Multi-Modal Transformer with Masking for Alzheimer's Disease Classification [1.9336815376402718]
Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention.<n>This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness.
arXiv Detail & Related papers (2025-12-16T15:24:57Z)
Apriel-H1: Towards Efficient Enterprise Reasoning Models [6.630534140883356]
Apriel-H1 family of hybrid LLMs combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size.<n>We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA.
arXiv Detail & Related papers (2025-11-04T15:17:43Z)
Fast attention mechanisms: a tale of parallelism [52.7657529272906]
We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity.<n>We prove that ANNA-transformers retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms.
arXiv Detail & Related papers (2025-09-10T20:59:44Z)
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time. By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z)
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z)
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging [46.23615648331571]
Hierarchical Separable Video Transformer (HiSViT) is a reconstruction architecture without temporal aggregation. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network ( GSM-FFN) Our method outperforms previous methods by $!>!0.5$ with comparable or fewer parameters and complexity.
arXiv Detail & Related papers (2024-07-16T17:35:59Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.